Customizing Anomaly Detection for Individual Metrics (Advanced)

Moogsoft includes a default set of anomaly detection engines and supported metrics. Each supported metric has a set of anomaly-detection settings that you can configure as needed.

Viewing and editing metric settings

You can view and edit metric settings in two places:

  • Choose Data Config > Integrations and go to the the integration: collector, CloudWatch, etc. Then click Configuration and drill down to the metric of interest.

  • Choose Observe > Metrics, select the metric of interest, and then configure the anomaly detection settings in the data point table on the right.

Before you begin

Before you change the default anomaly-detection settings for any specific metric, you should clearly identify and understand the following:

  • The criteria you want to define for normal and anomalous behavior for the metric of interest,

  • The detection engines and the configuration settings described on this page, and

  • The effects of changing the default anomaly-detection behavior for the metric of interest.

It is also good practice to monitor the affected metric closely after you change the detector to ensure that you are getting the detection behavior you want.

Anomaly Detectors

Moogsoft includes the following detectors.

Adaptive detector

The Adaptive detector identifies anomalies based on a statistical calculation against a median absolute deviation, which varies over time and determines the high and low thresholds. This detector is useful for metrics where performance does not deviate widely under normal conditions. For example, you might want to observe a specific server and detect sudden spikes or drops in CPU utilization that indicate possible problems with the OS, platform, or mission-critical apps running on that server.

Most supported metrics in Moogsoft use the Adaptive detector by default.

Threshold detector

The Threshold detector identifies anomalies based on a fixed upper and/or lower threshold. This detector is useful for metrics where you know the thresholds for normal and anomalous behavior for a specific host or platform, and these thresholds do not change over time. For example, suppose you want to identify anomalies in the amount of free physical memory on a specific server. You might define a lower bound of 10%, to signal the server might be running out of memory; and an upper bound of 90%, to indicate possible problems with the mission-critical apps running on that server.

Note

Few metrics use the Threshold detector by default. The criteria for normal and anomalous behavior can be highly variable and dependent on the specific metric, monitored host, and other factors. You should carefully consider the thresholds for the metric and host of interest to avoid false positives and false negatives in the anomaly-detection behavior.

Bitwise / false detectors

Bitwise and false detectors identify anomalies in a bitmask or Boolean metric. These detectors are provided to support metrics that evaluate whether a system is running correctly. You cannot configure these detectors or use them for custom metrics.

Anomaly detection settings

You can customize how Moogsoft detects anomalies and collects metric data for an individual metric. Not all of these options are configurable for all metrics.

  • Anomaly Detector: In some cases, you can change the anomaly engine for a specific metric.

  • Determine Severities: Choose whether to use calculated confidence to determine severity or set the number of deviations for each severity value.

  • Deviations: The number of standard deviations from the norm to determine if a data point is anomalous.

  • Hold for The number of anomalous data points to observe before generating an anomaly event.

  • Reset hold for: When a metric is in an anomaly state, this setting determines the number of non-anomalous data points to hold for before resetting the metric severity to Clear.

  • Learning: The number of datums to collect before anomaly detection begins.

  • Minimum Deviation: The minimum possible deviation used to calculate anomalies, based on the historic range of values. This setting is useful for metric data sets with very small ranges.

  • Samples: Number of datums to keep in memory to determine the threshold ranges.

  • Stateful: Generate an anomaly only when the metric changes state (True) or for every anomalous data point (False)

  • Symmetric: By default, the Adaptive detector calculates upper and lower sigma values to determine anomalies above and below the metric mean. Enable this to calculate the same sigma value for anomalies in both directions.

  • Vector: (Threshold detector only) Consider Low Threshold or High Threshold only, or both High and Low thresholds to identify anomalies.

  • Window : The number of data points sent to Moogsoft and displayed before and after each anomaly.

Hold for

The number of anomalous data points to hold for until generating an event. Suppose Hold for = 1. When a metric generates an anomaly, the detector holds for one more anomaly before it generates an event.

The Hold For window is 1 by default for most supported metrics. You might want to increase this number for a specific metric in the following corner case:

  • You want to reduce the "noise" for a metric that generates a lot of repeat anomalies. This can happen if a metric has a very short polling cycle and moves frequently between normal and anomalous range. This can cause the detector to generate a series of repetitive anomalies that say, in essence, "this metric is constantly switching between normal and anomalous range."

Note

If you increase the Hold For window, the detector might miss some anomalies that do not generate enough consecutive data points to trigger an anomaly event.

Consider the following metric, which switches between anomalous and normal states every 2 minutes or so. When hold-for and reset-hold-for are both set to the default of 1, this results in a spurt of anomalies.

hold-for-eq_1-v2.png

You might decide that this is normal behavior, and that you only want to generate anomalies when the metric is in anomalous state for 3 minutes or more. In this case, set hold-for and reset-hold-for to 2 or higher.

hold-for-eq_2-v2.png

Stateful

When True, generate an anomaly only when the metric changes state: when it enters an anomalous state, when its value changes significantly while in an anomalous state, or when it returns to a normal state.

When False, generate an anomaly for every anomalous data point.

The Stateful setting is True by default. You might want to set this to False for a specific metric in the following corner cases:

  • You are sending anomaly data to Moogsoft and you need to consider every anomalous data point for your AIOps analyses.

  • You are receiving anomaly notifications using the Slack integration, and a metric of interest has a very long polling cycle. You might want to set Stateful to False for this metric and thereby ensure that you continue receiving Slack notifications until the metric performance returns to normal.

Minimum deviation

The minimum possible deviation used to calculate anomalies, based on the historic range of values. This setting is useful for metric data sets with very narrow ranges.

Consider the following metric, where the value remains within 1.0 and 1.05 nearly all the time. If the range of values is very narrow, even tiny deviations can result in "false-positive" anomalies.

minimum-deviation-OFF.png

With the minimum deviation set to 0.3, only values outside of 0.3 times the mean (plus or minus) are considered for anomaly detection.

minimum-deviation-ON.png

Deviations

The number of standard deviations to determine if a data point is anomalous. You might want to change this setting in the following cases:

  • A metric changes frequently and over a wide range, which causes the detector to flag non-anomalous data points as "false-positive" anomalies. In this case you might want to increase the number of deviations.

  • A metric changes very little and a data point outside the norm, even by a small amount, indicates an anomaly. In this case you might want to decrease the number of deviations.

Note

Changing the number of deviations can affect anomaly detection dramatically. As with any change to an anomaly detector, you should closely monitor the metric after you apply the change to ensure that you are getting the detection behavior you want.

Consider the following metric, where the number of deviations is set to 4.

deviations-eq-4.png

If we lower the number of deviations to 2, more data points are now considered anomalous.

deviations-eq-2.png

If we raise the number of deviations to 8, fewer data points are now considered anomalous.

deviations-eq-8.png

Watch a video on Anomaly Detection Models in Moogsoft

  • Moogsoft anomaly detection works out of the box with no configuration.  But let’s take a moment to learn how it works, so should you need to tweak the default settings you can do so to produce the result you want.

  • Moogsoft starts to detect anomaly as soon as the source data is ingested.  

    That said, it needs to learn how your data behaves before defining what’s normal and what requires attention.

    During the learning period, express will still generate anomaly, but will not assign severity.  All alerts are assigned the UNKNOWN severity level (purple).   The default correlation settings will bundle them into incidents, but the best practice is to ignore the clustering at this stage.  You may want to edit the scope in your default correlation engine to filter out the alerts with the Unknown severity level.

  • There are several different anomaly detection mechanisms in Moogsoft Express, and it chooses the one that is optimum for each metric type.  But should you desire, you can override the default detection model, and pick a different one.  Let’s spend a few moments to learn about each type.

  • Adaptive Detector is a dynamic thresholding mechanism and is used for most data.  The Adaptive Detector makes statistical analysis of data against the running median absolute deviation, and determines the high and low thresholds.  It is not following any set pattern, but rather adjust to the changes when a new pattern emerges, and quickly figures out the new norm.  

    In this example, notice this (point to the upper left yellow dot) is considered an anomaly while these are not because they happen during a peak time.  (show a flat line) If you used a fixed threshold you will not be able to do this.

  • Let’s see the adaptive detector at work in the environment we just set up.

    This one uses the adaptive detector, so the anomaly threshold will change as the data distribution changes over time.

  • Bitmask and False engines are for processing binary data.  While Bitmask is integer, False is Boolean text.  

    For the metrics that has clear on or off, up or down status, Moogsoft Express will select this engines.

  • Since the anticipated value is either true or false, Express is using the false detector to find anomaly.  And here’s the one that uses bitmask…..

  • If you prefer, a static thresholding is available too.  This is a simple evaluation of data against hard-coded threshold values.  But what’s notable is that Moogsoft offers not just high threshold but low threshold too!

  • Lastly, the model detector is another dynamic thresholding mechanism like the adaptive detector.  But rather than constantly calculating the norm, it creates a model from the observed pattern.Here’s a scenario.  Let’s say we are monitoring a service that streams movies.  As more people tend to use the service in the evenings, the norm in the evenings can be twice as high as the morning.  This fluctuation is expected so you want to adjust the threshold for that time frame.  

    For the metrics like this, you can create a model based on the past data, and use it as a baseline for new incoming data.

  • So now we know how Moogsoft identifies the anomalies.  Now, let’s talk about how it considers the level of anomaly.  For each anomaly, Moogsoft will assign a color coded severity level, so an anomaly that is very far from the norm will have the red critical label,  As the level of anomaly lessens, the severity label will change to orange, yellow and blue.  And of course, you already know the purple means “unknown” due to the fact that Moogosft is still learning from the incoming data.

  • Also, aside from the severity levels, Moogsoft assigns a confidence value for each anomaly. This is unrelated to how severe the anomaly is, but more about how certain Moogsoft is about each detected anomaly being an anomaly.

  • For the rule-based detectors like threshold, false, and bitmask detectors, the confidence level will always be 1, which means we are 100% confident that the given data should be considered an anomaly.For other detectors, Express assigns the confidence value between 0-1 by comparing the given anomaly against all the other anomalies for that metric.  So the close the value is to 1, the more confident you can be that the detected anomaly truly requires your attention.  

    Now you know the anomaly detection models in Moogsoft. 

    Thanks for watching!

Watch how to Analyze Data in Moogsoft

  • Vector allows you to set the upper threshold, lower threshold, or both upper and lower thresholds.

    For example, you don’t care how low the errors are, so you can set the upper threshold as the error gets close to the error budget value.

  • Or, you canset the lower threshold only for something like memfree.  You don’t care if a lot of memory is free, so you only set the lower threshold for when the free memory value goes below a certain level.

  • Of course you can set both upper and lower thresholds.  For example, the CPU usage for our EC2 environment is typically around 50%.  Naturally you want to watch for the high CPU usage, but also if the usage goes down too too low, then most likely your processes died and it’s also a problem.

  • Next, hold for is the number of data points to observe before generating an anomaly. 

    For example, if there’s a spike in your metrics and one data point falls out of the norm.  If your hold for value is set to 1 or above,  then the collector will wait for the next data point to see if it is also an anomaly before sending the data to Moogsoft.  (point at the first pink dot) In this example the next data point went back to the norm, so the collector will not process this further.(point at the second pink dot) In this example, the second data point stayed outside the norm, so at this point, the collector will handle this as an anomaly.  

    If you want the first anomalous data point to be processed an anomaly right away, then set the hold for value to

  • You may be wondering how to best determine the hold for value, so here’s a best practice advice.

    Consider the polling period and SLA.

    For example, suppose your polling period is every 5 minutes and your hold for value is set to 1.  Then it’ll be potentially almost 10 minutes from the problem starts before the anomaly is reported.Does this meet your SLA?  If so, great, if that’s too long, then you might want to change the hold for to 0.

  • Stateful lets you choose whether you want to only report on the anomaly that changes the state of the metric.  

    For example, if you set the stateful value to be true, these two data points that shows significant change in state will be considered an anomaly.

  • If you choose false, then all these will be processed as anomalies

  • Window is where you can define how many data points should be included when a collector sends the anomaly data to Moogsoft Moogsoft.

    For example, if you set it to 1, 1 data point before and 1 after will be sent to Moogsoft.

  • Here’s a common questions we get - how big should the sample size be?

    If the sample size is too small, you may not catch the typical fluctuations.

  • In general, a bigger sample size renders a better calculation of standard deviation.  So you will end up getting less false positives.

  • However if the data doesn’t vary too wildly, making sample size too large makes everything “normal”.  Also the larger the sample data is, the more memory it takes.

  • In general set the learning period pretty close to the sample period. Now you know some of the key metrics attributes in Moogsoft Moogsoft to help you analyze data.  Thanks for watching!

  • Consider you are monitoring the usage of a terabyte disk with little activities.  The memory usage never changes so the metric just shows almost a flat line.

    Then the median absolute deviation never varies from your standard deviation. 

    So your standard deviation goes down to zero and your threshold basically just disappeared.

  • As a result, any change at all ends up being considered an anomaly.  How can we avoid this?

  • So that's where acceptable delta comes in. For values that very rarely change, rather than try to calculate the standard deviation, set the acceptable data to manually set the threshold.

    Acceptable delta sits on top of the adaptive detector.