Correlate Alerts into Incidents

Correlation is the process of clustering related alerts into incidents. Moogsoft uses correlation definitions that specify the data fields of interest to determine if an alert and incident are correlated. To define an effective correlation, you need to determine the following:

  1. How you want to correlate your alerts — such as by node, service, or location.

  2. The alert fields in your data that contain the relevant information.

Correlation example: how it works

This example illustrates how Moogsoft clusters alerts with just a few high-level bits of information from you.

You're a DevOps engineer responsible for setting up Moogsoft. You're using Datadog, AWS CloudWatch, and Moogsoft collectors to monitor your infrastructure. You go to the Correlation editor and create a new correlation.

You want to cluster application, system, cloud, and network alerts for the same service in the same location. When prompted for the alert fields to correlate, you select the following:

  • service field, with 80% similarity. The 80% similarity helps cluster alerts with slight variations on the same service name.

  • location field, with 100% similarity. You want to make sure that all the alerts in each incident are from the same location.

correlation-example-screenshot.png

You accept the defaults for the other options. You click Save and the correlation engine processes new alerts using your correlation:

  1. AWS CloudWatch sends an alarm to the Events API: inbound network traffic is high on cntnr23.

    1. The alarm passes through the enrichment and deduplication pipelines.

    2. The alarm arrives at the correlation engine as Alert 100 with class = Network, service = dbQuery_1.0, and location = us-east-1.

    3. Your correlation filter allows for class = Network, so this is a candidate for correlation. The engine creates a new incident based on this alert. As yet there are no correlations with any other alerts.

    v3-correlation-example-01.png
  2. AWS CloudWatch flags an anomaly and sends it to the Metrics API: a spike in 4xx (bad requests) at a specific endpoint.

    1. The anomaly passes through the pipeline and arrives at the correlation engine as Alert 101 with class = API.

    2. The correlation scope does not include class = API, so this alert is not a candidate for correlation.

  3. A Moogsoft collector flags an anomaly and sends it to the Metrics API: free memory is down on cntnr00.

    1. The anomaly passes through the pipeline and arrives at the correlation engine as Alert 102 with class = System, service = wsfe_1.0, and location = us-west-1.

    2. The alert passes your correlation filter, so the engine creates a new incident. Again there are no correlations based on your definitions.

    v3-correlation-example-02.png
  4. The anomaly detector flags another anomaly in the Datadog metrics: HTTP response times are spiking on cntnr01.

    1. The anomaly arrives at the correlation engine as Alert 103 with class = Application, service = custLogin_1.0, and location = us-west-1.

    2. The engine finds that Alerts 102 and 103 are correlated based on your definition: Their services meet the 80% similarity threshold ( custLogin_1.0 and wife_1.1) and their locations are both us-west-1.

    3. The engine adds Alert 103 to Incident 2.

    v3-correlation-example-03.png
  5. The anomaly detector flags two CloudWatch metrics for another container: CPU utilization and disk write operations are significantly down on containers running the customer-login service.

    1. Alert 104 arrives with class = Cloud, service = custLogin_1.1, and location = us-east-2.

    2. This alert passes the filter but does not correlate with any open alerts. The engine creates Incident 3 with alert 104.

    3. Alert 105 arrives with class = Cloud, service = cust-login_1.2, and location = us-east-2.

    4. The engine finds that alerts 104 and 105 are correlated based on their similar service names and identical locations. The engine adds alert 105 to incident 3.

    v3-correlation-example-04.png
  6. Another 15 minutes go by. Moogsoft ingests more alerts and adds them to incidents based on your correlation. Another alert arrives from CloudWatch, noting that network traffic is still high on cntnr23.

    1. The alarm arrives at the correlation engine as Alert 117 with class = Network, service = dbQuery_1.0, and location = us-east-1.

    2. This alert matches your correlation filter. It also meets the source and location similarities with Alert 100 in incident 1.

      However, the correlation time window for incident 1 has expired. Each correlation has a configurable time window, which is set to 15 minutes by default.

    3. The engine creates a new incident and adds alert 117 to it.

    v3-correlation-example-v5.png

Watch a video

Watch a video on the Correlation Engine in Moogsoft.

Watch a video on Configuring a Correlation Engine in Moogsoft Observability Cloud.

  • In Moogsoft Express, related alerts are grouped into an incident.

  • In this video, you will learn how to correlate your alerts into incidents in Moogsoft Express.  

    Specifically, you will be able to explain the default correlation settings and how alerts are correlated into incidents.  Also, you will be able to configure new clustering settings in the correlation engine.

  • Correlation engine is where you manage all alert clustering configurations.  

    You start out with one out-of-the-box clustering setting, so your alerts will be grouped into incidents without any configuration on your end.

  • This is the default correlation setting.  The incident created by this correlation will have a dynamically composed description.  The default description shows how many sources are affected 'unique_count(source)' and the top three sources  'unique(source,3)', services 'unique(service,3)', and event classes 'unique(class,3)' involved in the incident.

    Scope defines which alerts are going to be evaluated by this correlation definition.  Consider it like an entry filter.  Since right now there’s only one correlation, it will evaluate ALL alerts.  It will evaluate the source field values, and the alerts whose source field values match more than 45% will be clustered into an incident.

    The time window for correlation is automatically set.

  • What do we mean by automatically set?  Express uses a flexible time window between 15 and 45 min.  Let me show you how it works.

  • When an incident is created, the correlation engine starts a timer.

  • If more qualifying alerts come in, they are added to the incident.

    But we don’t want to keep adding alerts to the same incidents forever.  If you keep the incident open for new alert membership for an indefinite amount of time, you’ll end up mixing multiple separate issues.

  • So here’s how Express handles the time window.

    Alerts keep getting added until the 11 minutes and 15 second mark since the moment the incident formed.

  • But, if no alerts get added from that point till the 15 minute mark, the window closes at that point.

  • But if another qualifying alert comes in during this time frame, then Express extends the time window by 3 minutes 45 seconds, setting the new time window to be 18:45.

  • If another qualifying alert comes in before the new time limit, then the window extends by 3 minutes 45 seconds again.This continues up to 45 minutes

  • So, this is the default correlation definition.

  • Now, I’m sure you want to add your own correlation definitions that take care of your specific use case, so we are going to try that next.

  • Consider this use case.  You are implementing Moogsoft Express to make it easier for the infrastructure team to collaborate with the application team that supports the impacted services.The infrastructure team is organized by region, and each regional team breaks down into a few service categories.

    So, if the server that supports service A in the US suffers, you want this team (AMERICAS - service A) and this team (Application 1 - service A) to come together and investigate.

    Also, since the infrastructure teams are organized by location, you need to differentiate the events originated in different regions.  Just because service A is impacted, you don’t want to bundle alerts from the AMERICAS and EMEA clusters together.

  • You also noted that in the CMDB some of the values are not spelled consistently, like this for example.  But we want to treat these as the same service.

  • Based on the analysis so far, here’s how we want to correlate our alerts for this use case.  

    • Evaluate only the alerts whose class are Application or Infrastructure

    • Location - 100% match

    • Service - 80% match to accommodate the variation in spelling.

    Now let me show you how to set this up.

  • Provide a name that makes sense to other administrators too.

    What you put into the description field will be used as the incident description. (switch to a sample incident screenshot and highlight the description area) so you want to make it as helpful as possible.We are going to use macros to insert dynamic information here. We want the description to mention the impacted service and the location….(as you enter the filter value)Scope is basically an entry filter.  In our use case, this correlation only applies to the alerts whose class is application or infrastructure.

    And we want to cluster alerts from the exact same location,And the same service but considering the variation in spelling, we’ll set the similarity value at 80%.We’ll leave the rest as default for now and save.

  • Now, let’s also modify the default workflow, so the alerts evaluated by the new correlation setting won’t be evaluated by the default correlation, and end up in two separate incidents.

  • We are going to add a scope filter, and exclude the alerts whose class field values are application or infrastructure.

  • Now let’s simulate what happens to the alerts with this new correlation definition.  Here’s the scenario.  The server in our San Francisco data center that supports a database query service has overheated and started to fail.  Of course no one knows that yet.

    The first event comes in… The event goes through enrichment to add value to the service field, and gets deduplicated into an alert.  Right now this is the first event of its kind so it simply becomes an alert, but of course if another event with the same dedupe key value arrives, it will be deduplicated into this alert.

    Now it arrives at the correlation engine.

  • The class field value meets the scope, so it enters the correlation we just configured.Since this is the very first one, it becomes an incident on its own.

  • The next event arrives. It is also impacting the same service as the last event, but note that its class value is not Infrastructure or Application.

  • So, it is filtered by the correlation we set up.  Instead it goes into the default correlation, and becomes a new and separate incident.

  • Here comes another event. It’s enriched, deduplicated, and...

  • Now, this meets the correlation criteria we’ve defined, so this alert is bundled together with the first alert, and becomes part of the incident 001.Since we set the services match percentage to 80 rather than 100%, it accommodates the difference in capitalization.

  • Now, this incident contains an infrastructure alert and an application alert on the same