Skip to main content

Use case walkthrough: User workflow in Moogsoft Cloud

In this video, we will step through the typical workflow of a Moogsoft user as they work through incidents.

Here comes a slack message, notifying us there’s a critical incident requiring our attention.

1_User_Workflow_with_Sit_Room.jpg

So we click through to Moogsoft Cloud, which takes us to this incident’s Situation Room.

2_User_Workflow_with_Sit_Room.jpg

The Situation Room is where you and your team can collaborate on an incident. This is the timeline for this incident. These sliders let you zoom in on particular areas, and the list below filters to match the time frame you choose.

3_User_Workflow_with_Sit_Room.jpg

Let’s examine all alerts. These are the alerts that make up this incident.  Some of these are alerts from a monitoring system.

4_User_Workflow_with_Sit_Room.jpg

And these are alerts generated by Moogsoft based on the metrics it is tracking.

5_User_Workflow_with_Sit_Room.jpg

We are going to own this incident.

6_User_Workflow_with_Sit_Room.jpg

Now we will start our investigation.

These alerts came in within a few seconds of each other. Let’s look at the details of this first alert.

7_User_Workflow_with_Sit_Room.jpg

All attributes of this alert are visible now.

8_User_Workflow_with_Sit_Room.jpg

And the metric information of the alert is visually presented here.

9_User_Workflow_with_Sit_Room.jpg

Moogsoft shows you the relevant context and their relationship to each other. This way, it’s much easier to grasp how the whole incident unfolded over time.

According to this, the volume queue length metric exceeded the threshold level and triggered a warning alert.

10_User_Workflow_with_Sit_Room.jpg

Then the CPU usage metric on our front end server increased and triggered a warning alert...

11_User_Workflow_with_Sit_Room.jpg

...the activity on the backend server fell...

12_User_Workflow_with_Sit_Room.jpg

...and we are seeing a backend connection error critical alert.

22_User_Workflow_with_Sit_Room.jpg

So, could this be the root cause that had a cascading effect to cause other alerts?

13_User_Workflow_with_Sit_Room.jpg

Like this, looking at the key happenings and context of the time series data gives you an instant understanding of how the incident unfolded.

We’ve confirmed that the disk I/O bottleneck indicated by the increase in Volume Queue Length is causing the incident. We have a runbook tool we can use to free up resources.

14_User_Workflow_with_Sit_Room.jpg

Currently the time window we are seeing is from the moment when the first alert in the incident occurred. We want to see what happens to the metrics when we run the tool. So let’s change the time frame. Now, the metrics are going to be updated in real time.

15_User_Workflow_with_Sit_Room.jpg

We’ve run the tool, and with the runaway processes that were overloading I/O killed, the CPU load on the front-end web server is back to normal...

16_User_Workflow_with_Sit_Room.jpg

...and activity has resumed on the back end server as well.

17_User_Workflow_with_Sit_Room.jpg

Nice! The anomaly has resolved and now the metrics are within the normal range previously learned by the system. Good job!

18_User_Workflow_with_Sit_Room.jpg

Now the alerts in our incident are all clear, as well as the incident itself.

19_User_Workflow_with_Sit_Room.jpg

The incident status has been changed to resolved...

20_User_Workflow_with_Sit_Room.jpg

...and the case is closed!

21_User_Workflow_with_Sit_Room.jpg

Just like that, we have resolved our first incident in Moogsoft. Now it’s your turn to experience this workflow yourself!

Thanks for watching!