Skip to main content

Anomaly detectors

APEX AIOps Incident Management metric policies include the following detectors.

  • Adaptive detector — Useful for metrics with consistent ranges in normal conditions.

  • Threshold detector — Useful for metrics with fixed thresholds for normal vs. anomalous behavior.

Adaptive detector

The Adaptive detector identifies anomalies based on a statistical calculation against a median absolute deviation which varies over time, and determines the high and low thresholds. This detector is useful for metrics where performance does not deviate widely under normal conditions. For example, you might want to observe a specific server and detect sudden spikes or drops in CPU utilization that indicate possible problems with the OS, platform, or mission-critical apps running on that server.

The adaptive detector gathers data for metrics over time and calculates a range of values that describe average, or normal, operational parameters. Values outside this range, either high or low, are considered anomalies and produce alerts.

Additionally, metric policies using an adaptive detector also include the option to include a "confidence zone," which acts like guardrails for anomaly detection. For some systems, many data values may be outside the adaptive detector's calculated normal range but do not require alert creation. For example, a system that only has 3% CPU utilization most of the time but occasionally jumps to 25% could be operating normally. The confidence zone can prevent alert creation for data points that indicate an anomaly, but are still within an acceptable operating range.

Threshold detector

The Threshold detector identifies anomalies based on a fixed upper and/or lower threshold. This detector is useful for metrics where you know the thresholds for normal and anomalous behavior for a specific host or platform, and when these thresholds do not change over time. For example, suppose you want to identify anomalies in the amount of free physical memory on a specific server. You might define a lower bound of 10%, to signal the server might be running out of memory, and an upper bound of 90%, to indicate possible problems with the mission-critical apps running on that server.