Key concepts

Before you start to set up Moogsoft Express, it is important to understand the following concepts and the end-to-end data pipeline: the raw data you ingest, the steps you take to enrich and correlate this data, and the final results you see in the web UI.

Figure 1. The Express Data Pipeline
The Express Data Pipeline



Data Ingestion

The first step in the data pipeline to ingest monitoring metrics and events of interest from your infrastructure. Express includes a set of inbound integrations that enable you to ingest the following types of data.

Event notifications from monitoring tools

The Events API can ingest event notifications from tools such as AppDynamics, New Relic, and DataDog. Most monitoring tools can post notifications via REST when they detect events such as outages, degradations, and significant spikes or dips in key performance indicators.Events Integration BAK

Time series metrics

Metric integrations enable you to import time series metrics that show how key performance indicators change over time. Express performs anomaly detection on time series metrics: for each metric, it identifies when that metric enters and exits an anomalous state. These anomalous states often indicate performance issues on the monitored asset.

Express has the following metric integrations:

Collectors can detect performance anomalies directly at the source, before sending the data to Express. You can configure the metric collection and anomaly detection behavior on individual metrics as needed. You can also choose to collect metrics only, anomalies only, or both.

Events to alerts

Express converts each new event notification and metric anomaly into a generic JSON object that describes a specific performance-related event: what happened, when it happened, where it happened, and so on.

The next step is to add each new events into an alert. An event describes one specific event; an alert describes a set of events that all relate to the same issue. For example, an alert might be "High CPU load on server 23." and consist of the following events:

  1. server 23, 12:00: CPU load = 72%

  2. server 23, 12:01: CPU load = 80%

  3. server 23, 12:02: CPU load = 67%

Express adds new events to alerts as follows:

  1. A new JSON event object arrives at the ingestion endpoint.

  2. Express generates a dedupe key for the event based on the source, service, and check fields in the event.

  3. Express compares the new event with each open alert using the dedupe key.

    • If the dedupe key matches an open alert, Express increments the alert's event count field and updates the severity field based on the new event.

    • If the dedupe key does not match any open alert, Express creates a new alert and adds the event to it.

See also Event deduplication: how-to and best practices

Data enrichment

In some cases your monitoring goals might require more information than is contained in the raw data. Data enrichment is the process of adding custom data to alert descriptions. Enrichment has the following benefits:

  • Customize Alert Correlation

    You might find that you want to cluster alerts into incidents (described below) using criteria that is not included in the raw data you ingest into Express. For example, you might want to specify that a specific node corresponds to a specific app or service, or is managed by a specific operations team. You can specify an enrichment to map the app, service, or team to the specific node and then use this custom data in your correlation profiles.

  • Provide critical reporting data to make alerts and incidents more useful and understandable.

To set up an enrichment, you specify your enrichment data in a CSV file, upload it to Express , and map your custom fields to the existing alert fields in the web UI. Once you set up an enrichment, Express adds the custom data whenever it creates a new, matching alert.

Correlation: alerts to incidents

Correlation is the process of clustering related alerts into actionable incidents. In real-world terms, an incident is a cluster of alerts that all relate to the same issue. For example, an incident might consist of the following alerts:

  1. webServB23, 09:00, spike in incoming requests to myApp

  2. webServB23, 09:01, spike in CPU load

  3. webServB23, 09:01, dip in available memory

  4. webServB23, 09:02, spike in response times for myApp

Express includes a Correlation Engine that clusters alerts based on correlated data fields in different alerts: source nodes, times, alert types, custom tags, and so on. You can define different profiles to specify the correlation and clustering behavior you want. Each profile specifies the following:

  • A filter that specifies the set of alerts to consider for correlation

  • The specific data fields to compare, and the degree of similarity between fields, to determine if two alerts are correlated

  • The correlation time period -- that is, the maximum time window between two alerts to be considered correlated.

  • A description based on the fields and other data of interest in the component alerts

A common practice is to create a profile that clusters alerts by source node, so each resulting incident relates to a specific node within a specific time window (within the maximum correlation time period). You can create more customized profiles to cluster alerts based on alert type, application, service, location, ops team, and so on. You can use any common data fields to cluster your alerts.