Concept explainer: Alert correlation - how it works ►
This video explains how alerts are correlated into incidents in APEX AIOps Incident Management, as well as the behavior of the correlation time window.
*Please note Moogsoft is now part of Dell's IT Operations solution called APEX AIOps, and changed its name to APEX AIOps Incident Management. The UI in this video may differ slightly but the content covered is still relevant.
In Incident Management, related alerts are grouped into an incident.
In this video, you will learn how to correlate your alerts into incidents in Incident Management.
Specifically, you will be able to explain the default correlation settings and how alerts are correlated into incidents. Also, you will be able to configure new clustering settings in the correlation engine.
Correlation engine is where you manage all alert clustering configurations.
You start out with one out-of-the-box clustering setting, so your alerts will be grouped into incidents without any configuration on your end.
This is the default correlation setting. The incident created by this correlation will have a dynamically composed description. The default description shows how many sources are affected 'unique_count(source)' and the top three sources 'unique(source,3)', services 'unique(service,3)', and event classes 'unique(class,3)'involved in the incident.
Scope defines which alerts are going to be evaluated by this correlation definition. Consider it like an entry filter. Since right now there’s only one correlation, it will evaluate ALL alerts. It will evaluate the source field values, and the alerts whose source field values match more than 45% will be clustered into an incident.
The time window for correlation is automatically set.
You can set the correlation time window up to 24 hours. Let me show you how it works.
When an incident is created, the correlation engine starts a timer. Let’s say we keep the time window to 15 minutes.
If more qualifying alerts come in, they are added to the incident.
But we don’t want to keep adding alerts to the same incidents forever. If you keep the incident open for new alert membership for an indefinite amount of time, you’ll end up mixing multiple separate issues.
So, here’s how Incident Management handles the time window.
The key concept here is 50%. 50% of the default time window of 15 minutes is 7 minutes 30 seconds.
From the moment the incident formed, qualifying alerts keep getting added to this incident throughout the 15 minute time window.
But the last half, from 7 minutes 30 second to 15 minutes, is the key to determine when the time window for this incident actually closes.
If no qualifying alert arrives during the second half, then the window closes at the default 15 minute mark.
But if a qualifying alert comes in during the second half, say, at the 14 minute mark,
it triggers Incident Management to extend the correlation window. For how long?
Again the key is the 50%. 50% of 15 minutes is added from the point this alert arrived. So now the time window is set to close at 21 minutes and 30 seconds.
Suppose another alert arrives within the new correlation window, at 16 minutes. The window extends again from the arrival time.
By how much? Again, 50% of the default window!
This continues up to 3x the initial correlation time window. So in this case, as the max time allowed is 3x the correlation time window of 15 minutes, that would be 45 minutes.
You can also choose how many similar alerts are needed to create an incident. The default is one alert, so every alert that arrives will either form a new incident or be added to an existing incident.
Now you know how Incident Management correlates alerts into incidents. Thanks for watching!