Correlate Alerts into Incidents

Correlation is the process of clustering alerts into incidents based on correlations in their data. Express uses correlation definitions that specify the data fields of interest to determine if an alert and incident are correlated. You can easily correlate alerts into the incidents you want, based on the specific needs of your organization. You don't need any broad or deep knowledge of your organization or infrastructure to define a good correlation. All you need to know is how you want to correlate your alerts — such as by node, service, or location — and the alert data fields that contain the relevant information.

Correlation example: how it works

This example illustrates how Express clusters alerts with just a few high-level bits of information from you.

You're a DevOps engineer responsible for setting up Express. You're using Datadog, AWS CloudWatch, and Moogsoft collectors to monitor your infrastructure. You go to the Correlation editor and create a new correlation.

You want to cluster application, system, cloud, and network alerts for the same service in the same location. When prompted for the alert fields to correlate, you select the following:

  • service field, with 80% similarity. The 80% similarity helps cluster alerts with slight variations on the same service name.

  • location field, with 100% similarity. You want to make sure that all the alerts in each incident are from the same location.

correlation-example-screenshot.png

You accept the defaults for the other options. You click Save and the correlation engine processes new alerts using your correlation:

  1. AWS CloudWatch sends an alarm to the Events API: inbound network traffic is high on cntnr23.

    1. The alarm passes through the enrichment and deduplication pipelines.

    2. The alarm arrives at the correlation engine as Alert 100 with class = Network, service = dbQuery_1.0, and location = us-east-1.

    3. Your correlation filter allows for class = Network, so this is a candidate for correlation. The engine creates a new incident based on this alert. As yet there are no correlations with any other alerts.

    v3-correlation-example-01.png
  2. AWS CloudWatch flags an anomaly and sends it to the Metrics API: a spike in 4xx (bad requests) at a specific endpoint.

    1. The anomaly passes through the pipeline and arrives at the correlation engine as Alert 101 with class = API.

    2. The correlation scope does not include class = API, so this alert is not a candidate for correlation.

  3. A Moogsoft collector flags an anomaly and sends it to the Metrics API: free memory is down on cntnr00.

    1. The anomaly passes through the pipeline and arrives at the correlation engine as Alert 102 with class = System, service = wsfe_1.0, and location = us-west-1.

    2. The alert passes your correlation filter, so the engine creates a new incident. Again there are no correlations based on your definitions.

    v3-correlation-example-02.png
  4. The anomaly detector flags another anomaly in the Datadog metrics: HTTP response times are spiking on cntnr01.

    1. The anomaly arrives at the correlation engine as Alert 103 with class = Application, service = custLogin_1.0, and location = us-west-1.

    2. The engine finds that Alerts 102 and 103 are correlated based on your definition: Their services meet the 80% similarity threshold ( custLogin_1.0 and wife_1.1) and their locations are both us-west-1.

    3. The engine adds Alert 103 to Incident 2.

    v3-correlation-example-03.png
  5. The anomaly detector flags two CloudWatch metrics for another container: CPU utilization and disk write operations are significantly down on containers running the customer-login service.

    1. Alert 104 arrives with class = Cloud, service = custLogin_1.1, and location = us-east-2.

    2. This alert the correlation filter but does not correlate with any open alerts. The engine creates Incident 3 with alert 104.

    3. Alert 105 arrives with class = Cloud, service = cust-login_1.2, and location = us-east-2.

    4. The engine finds that alerts 104 and 105 are correlated based on their similar service names and identical locations. The engine adds alert 105 to incident 3.

    v3-correlation-example-04.png
  6. Another 15 minutes go by. Express ingests more alerts and adds them to incidents based on your correlation. Another alert arrives from CloudWatch, noting that network traffic is still high on cntnr23.

    1. The alarm arrives at the correlation engine as Alert 117 with class = Network, service = dbQuery_1.0, and location = us-east-1.

    2. This alert matches your correlation filter. It also meets the source and location similarities with Alert 100 in incident 1.

      However, the correlation time window for incident 1 has expired. Each correlation has a configurable time window, which is set to 15 minutes by default.

    3. The engine creates a new incident and adds alert 117 to it.

    v3-correlation-example-v5.png