Configure metric policies
Metric policies define how your metric data is handled (dropped, stored, or used for anomaly detection) and make it possible to generate anomalies for relevant metric values. You can choose which metrics to monitor and define the parameters that indicate anomalous conditions which generate alerts.
You can define the high and low boundaries for individual metrics using a threshold detector so that anomalies occur when sampled data values are above or below the defined thresholds. Alternatively, you can select an adaptive detector in the policy, which directs APEX AIOps Incident Management to observe your data over time to determine what constitutes usual operating parameters, and to create anomalies when data values are outside those parameters.
You can also choose how to handle sampled data. Configuration settings allow you to create policies which drop sampled data with particular characteristics (for example, data from a particular system, such as one used in a pre-production environment, or undergoing relocation, or configuration changes). You can use metric policy filters to drop all data associated with specific metrics, or to differentiate between unwanted data and other data you prefer to keep. Note that a small amount of data is stored for metrics which are always dropped, so a limited sample is available for review, if needed.
Add a metric policy
After you have set up metrics ingestion, you can add metric policies.
There is always one enabled metric policy—the Default Metric Policy. The Default Metric Policy stores metric data and performs anomaly detection using a default configuration. To customize anomaly detection for your environment, you can change settings in the Default Metric Policy, or create additional metric policies. When there are other policies, the Default Metric Policy determines how metric data which does not match any other metric policy is handled.
If you are not sure how to start creating your own metric policies, consider creating a policy for each metric with data you would like to track. Use the metric name as the scope, select an adaptive detector, and be sure to select Detect Anomalies & Store for each. You can always adjust the scope or create additional policies after you collect some initial data samples. You can also observe how the Default Metric Policy detects anomalies and make changes to it, or create a similar policy and see how changes impact anomaly detection.
Note
After configuring a policy, carefully monitor the metric data to ensure that the policy is producing the intended effect.
Navigate to Correlate & Automate > Metric Policies.
Click Create new policy.
Enter the appropriate information in the Name and optional Description fields.
It is helpful to choose descriptive names for your policies because the policy names display in the Alerts and Incidents pages for the alerts and incidents they are responsible for.
Define the metric this policy applies to in the Scope Query field.
The simplest scope is simply the name of a metric, similar to this example:
metric = <my_metric_name>
You can also construct a more complex query to specifically identify certain metric data that this policy will impact. You can use any metric fields in your metric scope:
metric = my_metric_name AND source = 10.0.0.5 AND tags.category = "cisco switch"
You can also use other metric attributes, like the source or managed object:
source = my_server
Note
Check the validity of scope queries by pasting the query in a metrics chart. If a chart displays for the query, then it is a valid scope for a metric policy. Valid metric policies include at least one item in this list:
key
managed object
mar
metric
source
tags
If you leave the scope blank, the policy applies to all metrics that are evaluated by it.
In the Action section, select how to handle the metric data within the scope of this policy by selecting Detect Anomalies & Store, Store Only, or Do Not Store.
Detect Anomalies & Store: Use this policy to monitor traffic for anomalies and to create incidents based on anomalous events.
Store Only: Save the metric data without examining it for anomalies. You may want to select this option if you would like to view the metric, but do not want to create alerts and incidents for the data.
There are no further options for creating a metric policy if you select Store Only. No alerts will be created for matching metrics when you save and enable the policy.
Do Not Store: Do not store the data for this metric. No anomaly detection will be performed.
There are no further options for creating a metric policy if you select Do Not Store. No alerts will be created for matching metrics when you save and enable the policy.
A temporary summary of the dropped metrics is stored for troubleshooting purposes. It is not viewable through the Incident Management interface.
In the Detect Anomalies & Store section, choose how to perform anomaly detection on the data for this metric.
Select the detector to use with this data: Adaptive or Threshold.
Policies can either allow anomaly detection to "learn" what constitutes a normal range of values for a metric, or use manually configured thresholds to define the normal range. Adaptive policies collect a configured number of data points before determining what constitutes acceptable operational values for the data within the policy scope. Alternatively, policies using a threshold detector rely on upper and lower threshold boundary values which you define, with the values outside of those boundaries potentially generating anomalies.
Do one of the following:
For policies using an adaptive detector:
In the Deviation field, enter the number of standard deviations a value can vary from the median (average value) before it is considered an anomaly.
In the Minimum Deviation section, enter the value to use to determine the thresholds regardless of calculated values. This setting increases the range of normal operational values.
In the Confidence Zone section, select whether to create anomalies for all values outside the observed normal range, or whether to include optional high and low threshold values and only create anomalies for values above or below those values (even if the values are outside the normally observed range of operation).
For example, you might use these values on a system that normally runs well below capacity but occasionally experiences periods of high usage. This could be a CPU that normally only runs at 4% of capacity, but is heavily used at times and reaches 40% of capacity. While 40% is much higher than 4%, it is still not a level which requires intervention. To avoid creating alerts in this scenario, you could create a low confidence zone value of 0 and a high value of 50.
In the Determine Severities section, either select Confidence to use automatically calculated values to determine the severity, or select Deviations to manually enter the number of deviations which determine each severity.
For policies using a threshold detector:
In the Confidence Zone section, Enter the threshold values in the High and Low fields. Metric data values outside of these boundaries generate anomalies.
In the optional Anomaly Detection Advanced Settings section, you can configure additional settings for your policy to further configure anomaly detection:
Note
Advanced settings are optional and can produce unintended consequences. If you choose to configure Advanced settings, be sure to carefully monitor any impacted metric series anomaly detection for unintended behaviors.
To revert advanced settings to the defaults, click Reset Advanced Settings To Default Values.
Anomaly Generation: This setting controls how anomalies are created.
State changes only: Creates an anomaly for every data point where the anomaly changes state (new matching alerts are created with different severity values). The State changes only setting identifies fewer duplicate anomalies, improves performance, and cuts down on noise while still relating the same information
Every anomalous data point: Creates an anomaly for every anomalous data point. This setting creates numerous—and often identical—events for each alert.
Hold: This setting controls the number of additional anomalous data points data points to wait to occur before identifying a data point as an anomaly. Changing this value can be helpful if you prefer to avoid creating alerts for short-term anomalous conditions that do not indicate true issues or require attention. A value of 0 will result in an anomaly being generated as soon as an out-of-range value is detected, whereas a value of 2 will result in an anomaly being generated only after three anomalous values are detected.
Hold Reset: The number of data points in the normal range that must be received for the metric before resetting the metric severity to Clear.
Learning Threshold: The number of data points Incident Management must receive before starting to generate anomalies with severity values. Until this threshold is reached, the metric is in a learning period, and events generated during this time are assigned a severity of "unknown."
NOTE: This setting is only available for metric policies using an adaptive detector.
Threshold Range Samples: Configure the number of data points to use as the representative group to determine what constitutes normal threshold values.
NOTE: This setting is only available for metric policies using an adaptive detector.
Vector: Use this setting to create anomalies for metric values:
Only when the value is below the low threshold
Only when the value exceeds the high threshold
When the value is either below the low threshold or above the high threshold
Symmetric: When enabled (the default), this setting calculates the same sigma for upper and lower threshold anomalies. When disabled, the adaptive detector calculates upper and lower sigma values to determine anomalies above and below the metric mean.
When your policy is complete, click Save.