Use case video: Power of alert correlation ►

Watch a video on Moogsoft Alert Correlation.

We’ll troubleshoot a sample incident to see the power of alert correlation.

Here’s a simple setup that we’ll be using for our demonstration. We have a few different sources that are sending data to Moogsoft.

Billing_Scenario_Sources.jpeg

For starters, we’re ingesting metrics using the Metrics API. Moogsoft is data agnostic, meaning that these metrics can be of any type, and can originate from any source.

We’re also using Splunk for logs and analytics, AppDynamics for application performance monitoring, and Prometheus for database and infrastructure performance monitoring.

And here’s what we are monitoring.

We have a three-tier, customer-facing application called Billing that relies on databases in Atlanta and New York.

Billing_Scenario_Service.jpeg

One of the switches in the Atlanta data center experiences some degradation - maybe some packet loss - which impedes the communication between the application server and database.

Billing_Scenario_Faulty_Switch

Since all three of these domains are monitored by different teams using different tools, we’ll see application alerts...

Billing_Scenario_Application_Alerts.jpeg

Database alerts...

Billing_Scenario_Database_Alerts.jpeg

And network alerts.

Billing_Scenario_Network_Alerts.jpeg

As an operator trying to resolve this problem, they often don’t have visibility into what other teams are seeing in their monitoring tools. So it takes time to synthesize your analysis of what you can see, with the insights from others.

That’s where Moogsoft comes in. Moogsoft identifies related alerts and groups them together, organizes them into relevant incidents, and presents these incidents with rich context, making it easy for us to identify the underlying problem.

Billing_Scenario_Correlation_Graphic.jpeg

Let me show you what I mean.

Here is the Incidents panel, where we’ll start our triage efforts.

Billing_Scenario_Incidents_Panel.jpeg

And this is the incident you’d get for the scenario we just went over.

Billing_Scenario_Incident_Highlighted.jpeg

The service impacted by this incident is Billing...

Billing_Scenario_Service_Highlighted.jpeg

And the description has been dynamically populated to tell us the critical information - in this case, the locations affected by the incident.

Billing_Scenario_Description_Highlighted.jpeg

And this incident contains 11 alerts. No doubt, these came from all four data sources.

Billing_Scenario_11_Alerts.jpeg

Sure enough, we got some application related alerts from AppDynamics, database alerts via Prometheus and Splunk, infrastructure alerts from Prometheus, and network alerts from the Metrics API. Such context helps us decide which teams in our organization we might want to engage.

Billing_Scenario_4_Classes.jpeg

We are going to own this incident.

Now let’s examine the alerts.

We’ll sort these alerts by First Event Time so we can see how the issue started, and how it evolved.

Billing_Scenario_Sort_By_First_Event_Time.jpeg

You can see in the Manager column how this incident combines alerts from four different data sources.

Billing_Scenario_Manager_Column.jpeg

And under event count, you can see how Moogsoft provides noise reduction by deduplicating events into alerts.

Billing_Scenario_Event_Count.jpeg

Now let’s look at the individual alerts.

The very first alert came from a switch in the Atlanta data center. It looks like the packet drops went out of bounds.

Billing_Scenario_Packet_Drops_Alert.jpeg

Only a few seconds after this alert was created, we began receiving additional alerts about database timeouts...

Billing_Scenario_Database_Timeouts.jpeg

Application page load failures...

Billing_Scenario_App_Load_Failures.jpeg

And various other issues.

So from this, it seems that the packet drop alert might be the root cause behind this incident.

Let’s do some further investigation by looking at the metrics for this incident. We’ll only examine relevant data from the past hour.

Billing_Scenario_Metrics_Last_Hour.jpeg

For each of these metrics, the pink line represents the raw values of the metric at any given interval. The light gray band in the background represents the normal operating range of the metric, as calculated by Moogsoft. And these colored dots represent anomalies in the metric that Moogsoft has detected and classified, in terms of significance.

This anomaly was initially classified as a Warning, because it only deviated slightly from the normal threshold.

E6622B7A-B29A-4E89-9FAE-4293DE15908C_4_5005_c.jpeg

But as the metric continued to stray out of bounds, the anomaly was reclassified as Critical...

4D3FE32A-0B12-4F89-B6C2-554B5177A502_4_5005_c.jpeg

And was later cleared when the metric went back in bounds.

AD7A2608-BB16-4DDC-AF1D-887F8F53D100_4_5005_c.jpeg

This metric is the packet drops alert we talked about earlier. Let’s see if it’s really the underlying cause of this incident.

9C71F486-82E4-4CB9-86C0-273DD6371789.jpeg

If we compare this graph to the graphs of the other alerts, we can see that all the anomalies started as soon as packet drops were detected...

3E136D31-0A64-4ABD-B4B6-949E549127FC.jpeg

And more importantly, all the anomalies were cleared as soon as packet drops were cleared.

BC1200CD-6E38-48B0-8E77-DEC0118FB52E.jpeg

This is a pretty good indication that our hypothesis is correct. The packet drop issue seems to be causing all the other alerts.

With this figured out, we know exactly who we should talk to about next steps. We’ll notify the team responsible for maintaining our network, and ask them to check on the faulty switch in the Atlanta data center.

Imagine how much time we’ve saved just now by having all the info from separate source systems in one place, and having them grouped together based on their relatedness for you. Now you know how you can have Moogsoft correlate alerts for you for a faster mean time to recover.

Thanks for watching!