Use case video: Devops user in Moogsoft ►

Watch a video on Moogsoft User Experience.

In this video, we will step through the typical workflow of a Moogsoft user as they work through incidents.

Here comes a slack message, notifying us there’s a critical incident requiring our attention.

Image1_Slack.png

So, we log onto Moogsoft, and find the incident.

These are the alerts that make up this incident.  Some of these are alerts from a monitoring system. And these are alerts generated by Moogsoft based on the metrics it is tracking.

We are going to own this incident.

Now we will start our investigation.

These alerts came in within a few seconds of each other. Let’s look at the details of this first alert.

Image2_SimilarAlerts.png

All attributes of this alert are visible now.

Image3_AllAtributes.png

And the metric information of the alert is visually presented here.

Image4_VisuallyPresented.png

This way, it’s much easier to grasp how the whole incident unfolded over time.

According to this, the volume queue length metric exceeded the threshold level and triggered a warning alert.

Image5_volumequeue.png

Then the CPU usage metric on our front end server increased and triggered a warning alert .  The activity on the backend server fell, and we are seeing a backend connection error critical alert.

Image6_cpuusage.png

So, could this be the root cause that had a cascading effect to cause other alerts?

Image7_cascadingrootcause.png

We’ve confirmed that the disk I/O bottleneck indicated by the increase in Volume Queue Length is causing the incident. We have a runbook tool we can use to free up resources.

Image8_incidentcause.png

Currently the time window we are seeing is from the moment when the first alert in the incident occurred. We want to see what happens to the metrics when we run the tool. So let’s change the time frame. Now, the metrics are going to be updated in real time.

Image9_timeframe.png

We’ve run the tool, and with the runaway processes that were overloading I/O killed, the CPU load on the front-end webserver is back to normal, and activity has resumed on the back end server as well.

Nice, the anomaly has resolved and now the metrics are within the normal range previously learned by the system.

Image10_anamolyresolved.png

And the alerts are now fully cleared. good job!

Image11_fullycleared.png

Now the alerts in our incident are all clear, as well as the incident itself.  The incident status has been changed to resolved... and the case is closed!

Image12_incidentresolved.png

Just like that, we have resolved our first incident in Moogsoft. Now it’s your turn to experience this workflow yourself!

Thanks for watching!