Metric anomaly detection

Moogsoft Express includes a default set of anomaly detection engines and supported metrics. Each supported metric has a set of anomaly-detection settings that you can configure as needed.

Viewing and editing metric settings

You can view and edit the settings for individual metrics from the Collectors and Integrations pages:

  • Linux Server metrics: Go to the Collectors page, then drill down to the metric of interest: managed_object_name > metric_name.

  • AWS Integration metrics: Go to Integrations > AWS CloudWatch, click Configuration, and then drill down to the metric of interest.

Before you begin

Before you change the default anomaly-detection settings for any specific metric, you should clearly identify and understand the following:

  • The criteria you want to define for normal and anomalous behavior for the metric of interest,

  • The detection engines and the configuration settings described on this page, and

  • The effects of changing the default anomaly-detection behavior for the metric of interest.

Anomaly detectors

Express uses the following anomaly detectors.

Adaptive detector

The Adaptive Detector identifies anomalies based on a statistical calculation against a median absolute deviation, which varies over time and determines the high and low thresholds. This detector is useful for metrics where performance does not stray very often and does not deviate widely under normal conditions. For example, you might want to observe a specific server and detect sudden spikes or drops in CPU utilization that indicate possible problems with the OS, platform, or mission-critical apps running on that server.

Most supported metrics in Express use the Adaptive detector by default.

Threshold detector

The Threshold Detector identifies anomalies based on a fixed upper and/or lower threshold. This detector is useful for metrics where you know the thresholds for normal and anomalous behavior for a specific host or platform, and these thresholds do not change over time. For example, suppose you want to identify anomalies in the amount of free physical memory on a specific server. You might define a lower bound of 10%, to signal the server might be running out of memory; and an upper bound of 90%, to indicate possible problems with the mission-critical apps running on that server.

Note

Few metrics use the Threshold detector by default, because the criteria for normal and anomalous behavior is highly variable and dependent on the specific metric, monitored host, and other factors. You should carefully consider the thresholds for the metric and host of interest to avoid false positives and false negatives in the anomaly-detection behavior.

Bitwise/False Detectors

Bitwise and False detectors identify anomalies in a bitmask or boolean metric. These detectors are provided to support metrics that evaluate whether a system is running correctly. You cannot configure these detectors or use them for custom metrics.

Anomaly detection settings

You can customize how Express detects anomalies and collects metric data for an individual metric. Not all of these options are configurable for all metrics.

  • Anomaly Detectors: In some cases, you can change the anomaly engine for a specific metric.

  • Vector: (Threshold detector only) Consider Low Threshold or High Threshold only, or both High and Low thresholds to identify anomalies.

  • Determine Severities: Choose whether to use calculated confidence to determine severity or set the number of deviations for each severity value.

  • Hold For: The number of anomalous data points to observe before generating an anomaly event.

  • Learning: The number of datums to collect before anomaly detection begins.

  • Acceptable Delta: The percentage deviation from median absolute deviations that will not generate an anomaly.

  • Stateful: Generate an anomaly only when the metric changes state (True) or for every anomalous data point (False)

  • Samples: Number of datums to keep in memory to determine the threshold ranges.

Hold for

The number of anomalous data points to hold for until generating an event. Suppose Hold for = 1. When a metric generates an anomaly, the detector holds for one more anomaly before it generates an event.

The Hold For window is 1 by default for most supported metrics. You might want to increase this number for a specific metric in the following corner case:

  • You want to reduce the "noise" for a metric that generates a lot of repeat anomalies. This can happen if a metric has a very short polling cycle and moves frequently between normal and anomalous range. This can cause the detector to generate a series of repetitive anomalies that say, in essence, "this metric is constantly switching between normal and anomalous range."

Note

If you increase the Hold For window, the detector might miss some anomalies that do not generate enough consecutive data points to trigger an anomaly event.

Stateful

When True, generate an anomaly only when the metric changes state: when it enters an anomalous state, when its value changes significantly while in an anomalous state, or when it returns to a normal state.

When False, generate an anomaly for every anomalous data point.

The Stateful setting is True by default. You might want to set this to False for a specific metric in the following corner cases:

  • You are sending anomaly data to Moogsoft AIOps and you need to consider every anomalous data point for your AIOps analyses.

  • You are receiving anomaly notifications using the Slack integration, and a metric of interest has a very long polling cycle. You might want to set Stateful to False for this metric and thereby ensure that you continue receiving Slack notifications until the metric performance returns to normal.