Skip to main content

Use case walkthrough: Deduplicate events to reduce noise ►

A busy service with multiple monitors can generate a flood of metrics, anomalies, and events. One issue might trigger a large number of repeat and duplicate events. Moogsoft Cloud analyzes every new piece of data — What is this? When did it happen? What is its severity? How often has it happened before? — and aggregates events for the same issue into alerts. Whenever it adds a new event, Moogsoft updates the alert fields — event count, last event time, severity — so the alert always contains the latest information about the underlying issue. This process removes the duplicate, repeat, and obsolete noise from the data stream.

One of the benefits of implementing Moogsoft is that you can reduce noise, and focus on what matters. In this video, we’ll take a look at the noise reduction mechanism, using sample data from the real world.

We’ve extracted this sample data from an actual monitoring environment. The data is anonymized, but the events are real.

4_xls.png

The event data is coming from three different sources: Cloudwatch, a home grown monitoring tool, and Sensu. These three tools have been monitoring a SaaS applications environment.

5_diagram.png

Here we’ve got about 1700 events. Let’s feed this to Moogsoft and see what happens.

Moogsoft deduplicates events and forms alerts. For example, this event indicates disk i/o is very high for this.dpcc_p2 server. Then, here comes another event to tell you the condition has worsened.

9_2_events.png

These should not be considered separate problems, so Moogsofts de-duplicates them into one alert.

10_Alert001.png

Let’s take a closer look. How exactly do we determine an event to be a duplicate?

In short, Moogsoft compares the dedupe_key values of the two events, and if they are a 100% match, the two events are deduplicated into one alert. The dedupe_key value is a combination of the Source, Class, and Check fields in an event. If the incoming event is using the service field, that value is used also.

18_Reducing_Pager_Fatigue.jpg

In this example, instead of two events, you now have one alert in the critical state. Because of the deduplication, you will not be allocating separate resources to each event, and you will have more context for troubleshooting.

19_Reducing_Pager_Fatigue.jpg

Now consider this case: After the critical event, another event arrives with the same dedupe key value. But this time, the severity value of the event is Clear. Maybe an automated runbook kicked in and addressed the issue.

14_matching.png

Moogsoft dedupes this event into an existing alert. The status of the alert updates to Clear.

15_Clear_alert.png

Without deduplication, it would take a manual correlation to figure out the issue has resolved itself. Have you had a series of link flapping events? All those up and down events would be consolidated into one alert with Moogsoft.

So, after the 1700 events have been processed, here’s what we got. They are deduplicated into 29 alerts.

1_Reducing_Pager_Fatigue.jpg

This is the power of noise reduction. It helps you direct your focus on what actually matters.

But we don’t stop here. These 29 alerts are processed further before you come in. They are now correlated based on their relatedness.

Here, the 29 alerts are now clustered into 3 incidents.

2_Reducing_Pager_Fatigue.jpg

Moogsoft evaluates alerts for their relatedness, and clusters the related alerts into one incident.To learn more about the correlation mechanism, watch the “Correlation Engine in Moogsoft” video.

3_Reducing_Pager_Fatigue.jpg

Imagine instead of getting paged 29 times for each of the alerts, now you get 3. No more pager fatigue for your team!

4_Reducing_Pager_Fatigue.jpg

Let’s examine one of the incidents and verify Moogsoft has made a meaningful correlation. Let’s see if we can figure out what’s going on with this incident. Let’s go into the Situation Room. The Situation Room is where you can collaborate with your team on an incident.

5_Reducing_Pager_Fatigue.jpg

Looks like a team is already assigned to this one.

6_Reducing_Pager_Fatigue.jpg

Here, we can see the timeline of activities pertaining to this incident.

7_Reducing_Pager_Fatigue.jpg

We can zoom in to any particular area of interest to filter the activities.

8_Reducing_Pager_Fatigue.jpg

Looks like the group assigned is already reviewing this incident. Let’s do the same!

9_Reducing_Pager_Fatigue.jpg

We’ll look at the alerts in the incident. The incident has 10 alerts, which consist of 169 events.

11_Reducing_Pager_Fatigue.jpg

Lets see what else we can learn from the alert details. We can see that the dpcc system is the one involved.

12_Reducing_Pager_Fatigue.jpg

We have a monitoring setting to generate events when there is no activity for a prolonged amount of time. The hosts in this HA pair are not generating data...

13_Reducing_Pager_Fatigue.jpg

...and it looks like a core service is down.

14_Reducing_Pager_Fatigue.jpg

The internal message queues are backed up.

15_Reducing_Pager_Fatigue.jpg

Looking here, it looks like we are having problems with slow database writes. It’s possible that database problems might have caused the core processing service to fail.

16_Reducing_Pager_Fatigue.jpg

So, through Moogsoft’s deduplication and correlation functionality, instead of looking at 1700 events to find 169 related events, you are presented with a complete picture from the start. Thanks for watching!