Correlation: Alerts into Incidents

This topic describes the Correlation Engine and the advantages of Express correlation over traditional AIOps approaches.

The Correlation Engine

The Correlation Engine uses advanced algorithms to detect correlations between different alerts and cluster these alerts into incidents. You can easily define smart correlations that make sense for your organization, even with no previous knowledge of your environment.

Each definition specifies the relevant data fields and the degree of similarity needed to correlate different alerts. Express then uses natural-language processing and other advanced algorithms, along with your definitions, to correlate new alerts with previous ones.

This approach is far more robust and scalable than traditional AIOps approaches based on hard-coded rules and pattern matching. This is especially true for complex environments and dynamic environments that rely on containers and microservices. A rules-based approach often leads to unpredictable results and a long, random list of simplistic and often contradictory rules. Most environments, even very complex ones, require only a handful of correlation definitions. One definition can do the analytical work of hundreds or thousands of rules.

How correlation works

In real-world terms, an incident is a cluster of alerts that all relate to the same issue. For example, an incident might consist of the following alerts:

  1. webServB23, 09:00, spike in incoming requests to myApp

  2. webServB23, 09:01, spike in CPU load

  3. webServB23, 09:01, dip in available memory

  4. webServB23, 09:02, spike in response times for myApp

The Correlation Engine identifies correlations between data fields to cluster alerts into incidents. You can correlate based on source nodes, times, alert types, custom tags, and other relevant information. You can define targeted correlations with the clustering behavior you want. Each definition has the following:

  • A filter that specifies the set of alerts to consider for correlation

  • A similarity test that includes the alert fields to compare and the degree of similarity for each field. Two alerts are considered similar if all fields meet the similarity criteria in the test.

  • The correlation time period -- that is, the maximum time window between two alerts to be considered correlated.

  • A description based on the fields and other data of interest in the component alerts

A common practice is to create a profile that clusters alerts by source node, so each resulting incident relates to a specific node within a specific time window (within the maximum correlation time period). You can create more customized profiles to cluster alerts based on alert type, application, service, location, ops team, and so on. You can use any common data fields to cluster your alerts.