Skip to main content

Deduplicate events to reduce noise

enrichevents.png

Watch a use case walkthrough: Reducing Pager Fatigue in Incident Management

Deduplication is the process of identifying repeated events and combining them into alerts. This reduces operational noise by limiting the number of alerts in the system.

How it works: the deduplication key

When APEX AIOps Incident Management ingests a raw event, it compares the new event with all open alerts. Does an open alert already describe the same issue, on the same node and service?

  • If there is a matching alert, Incident Management flags the new event as a duplicate and updates the alert.

  • If there is no matching alert, Incident Management creates a new alert.

Each incoming event has a deduplication_key field. By default,Incident Management autogenerates this key based on the sourceservice, and check fields in the event itself. (The deduplication key also includes class if an event includes this field.) This key defines the context shared by all events that belong to the same alert.

An example

Suppose a series of events all describe response times for the same microservice on the same host. Although their timestamps, descriptions, and severities differ, all these events have the same key: source = server 23, service = db-query-svc, check = response-time.

The following sequence illustrates how Incident Management deduplicates these events:

  1. The first event arrives with description = “db-query-response-time > 400 ms”  and severity = minor.

  2. Incident Management compares this event against all open alerts. There is no open alert with the same key. Incident Management creates a new alert based on the new event.

  3. The second event arrives with description = “db-query-response-time > 600 ms”  and severity = major.

  4. Incident Management compares the new event with the alert it just created. The event and alert have the same key. It updates the alert fields with the new event information:

    • Event count = 2 (was 1)

    • Last event time = new event time (was previous event time)

    • Severity = major (was previous event severity)

    • Description = “db-query-response-time > 600 ms” (was previous event description)

  5. The response time remains high and several more events arrive with varying levels of severity. With each event, Incident Management updates the alert as described previously.

  6. Finally, the response time falls to within acceptable levels. An event arrives with description = “db-query-response-time < 200 ms”  and severity = clear.

  7. Incident Management updates the alert as described previously. Because the status of this alert is now clear, any new events with the same deduplication key get added to a new alert.