Correlation Engine

 

Correlation is the process of clustering alerts into incidents based on correlations in their data. Express uses correlation definitions that specify the data fields of interest to determine if an alert and incident are correlated. You can easily correlate and cluster your alerts into the incidents you want, based on the specific needs of your organization. You don't need any broad or deep knowledge of your organization or infrastructure to define a good correlation. All you need to know is how you want to cluster your alerts — such as by node, service, or location — and the alert data fields that contain the node, service, location, or other fields of interest.

Correlation definitions

The Settings > Correlation Engine page in the UI lists the active definitions in your Express instance. Each definition specifies the following:

  • The description to apply to new incidents, based on alert data of interest.

  • The definition scope — that is, whether to consider all new alerts or filter out irrelevant alerts.

  • The alert data fields to consider for correlation, such as source or service.

    Note

    A definition can have multiple fields. Each field also has a similarity setting that defines the degree of similarity in the field strings for two alerts to be considered correlated — they don't need to be an exact match. All correlation fields are described in Defining a correlation below.

  • The correlation time window — that is, how long an incident is a candidate for correlation using this definition.

Good practices for defining correlations

The key to defining good correlations for your organization is to think about how you want to organize your alerts into incidents. Ask the following questions:

  • Which data fields do you want to use to correlate your alerts into incidents? For example, you might want to use fields such as:

    • source: Correlate alerts into incidents based on the nodes where the performance-impacting events occurred.

    • service: Correlate alerts into incidents based on the apps or services that generated the alerts.

    • location: Correlate alerts into incidents based on the physical locations where the performance-impacting events occurred.

  • How many fields do you want to consider for correlation?

    You can include multiple fields in a definition. Fewer fields in a correlation results in fewer, more general incidents with more alerts. More fields in a correlation results in a higher number of incidents with fewer alerts.

  • How similar do the corresponding values need to be for an alert and an incident to be correlated?

Before you begin

Do the following:

  1. Set up your event ingestions.

  2. Examine your alerts to determine if they include the data that you want to use for correlation. If they do not, set up your alert enrichments to add this data.

Defining a correlation

To define a correlation, go to Settings > Correlation and click Add Correlation Definition. Each definition has the following settings:

  • Correlation Name

  • Incident description

    The description to use for all incidents that get generated from this correlation definition. These descriptions will appear in the Incidents window. You can use macros to generate incident descriptions dynamically based on the member alerts, as described below.

  • Scope

    You can specify an alert filter to limit the scope of alerts to consider for a specific correlation.

  • Fields to correlate

    The set of alert data fields to consider for correlation, and the similarity required for a match between an alert and an incident.

  • Correlation time period

    The window for correlating alerts into the same incident.

Incident description

You can specify incident descriptions and fields dynamically, based on the alert data in each incident. For example, suppose you are defining a correlation based on the custom_info.services alert field. You can then specify a label string such as

Incident has occurred for cited(services) Services.unique(class,3) Affected by cited(check,2) checks

Given this string, the resulting descriptions include the two most-cited services and the number of times each service is cited by a member alert:

Incident has occurred for ShoppingCart, Online Store Services. Storage, Compute, Network Affected by Disk, CPU checks

Incident macros

You can use the following macros to generate incident descriptions:

  • count(alert-field) — Return the count of alert-field citations, including duplicates.

  • unique_count(alert-field) — Return the count of unique alert-field citations, excluding duplicates.

  • tolist(alert-field) — Return a comma-separated string of all elements in a list, including duplicates.

  • unique(alert-field, N) — Return a comma-separated string of N unique elements in a list, excluding duplicates.

  • top(alert-field) — Return the top-cited item.

  • cited(alert-field, N) — Return a list of the top-cited N items. If two or more items have the same cite count, the items are sorted alphabetically.

  • tolist(alert-field) — Converts an array of elements to a comma-separated string.

Scope

If the correlation is relevant only to a subset of alerts, you can enter a search string to consider only alerts of interest.

Fields to correlate

The set of alert data fields to consider for correlation, and the similarity required for a match between an alert and an incident. An alert and an incident are considered correlated if all the data fields in this table meet the specified degree of similarity.

Good practices for defining correlations provides guidance about the specific fields you might want to consider for correlation.

Alert field similarity

Express uses the bag-of-words model and the shingling natural-language processing methods to calculate the text similarity between two fields. The following example illustrates how Express determines if two text fields are similar.

  1. A correlation definition specifies service as the one field to correlate, with a similarity threshold of 90%.

  2. The correlation engine receives an alert with service = loginver012. A candidate incident includes loginver011 as one of its services.

  3. To determine if the alert correlates with the incident, the engine does the following:

    1. Splits each string into a set of shingles based on the default shingle size, which is 2.

    2. Compares each 2-character sequence in the alert field with the corresponding sequence in the incident field:

      lo og gi in nv ve er r0 01 12
      lo og gi in nv ve er r0 01 11
    3. In this case, 9 out of the 10 shingles are identical. Because the similarity threshold is 90%, this alert meets the correlation threshold. The correlation engine adds the new alert to the incident.

Correlation time period

The time period for correlating alerts into the same incident, starting from the incident creation time. When the correlation period ends, Express correlates alerts into a new incident.

The options are as follows:

  • Automatic: With this option, the correlation period is 15 minutes but can extend up to 30 minutes. The extension period is 3 minutes 45 seconds — one-fourth of the correlation period. The correlation period can extend as follows:

    • If a new alert arrives within 3 minutes 45 seconds of the expiration time, the period extends another 3 minutes 45 seconds.

    • If another alert arrives from 15 minutes to 18 minutes 45 seconds, the period extends another 3 minutes 45 seconds.

    • This process can repeat up to 30 minutes, the maximum possible time period.

  • Manual: Select this option if you know the time window that you want to specify.