Example: Implementation planning
This topic steps through the implementation planning process with a sample use case.
What actionable incidents do you want?
Jane is a DevOps engineer in a small e-commerce company. She works with an app development team and an IT Operations team. Currently these teams are highly siloed. Jane wants to increase visibility between these teams, help them collaborate more effectively, and streamline the process for investigating and resolving incidents.
The two teams use a variety of tools to monitor a set of mission-critical apps and the infrastructure that supports them. Jane wants to produce incidents for specific services such as login, db-query, user-management, and so on. Each incident needs to highlight a specific service issue or an infrastructure element that affects, or is affected by, a service.
APEX AIOps Incident Management ingests raw events, enriches them, and then converts them to alerts. The Correlation Engine then groups related alerts into actionable incidents. An actionable incident
Includes all data relevant to one specific issue or problem.
Excludes duplicate data and irrelevant noise.
Combines data from all relevant sources – such as the application, server, cloud, and container.
Enables your teams to collaborate, investigate, troubleshoot, and resolve issues efficiently.
Consider what kind of actionable incidents would be useful in the context of your environment, organization, and end users:
Who are your end users/teams? IT ops engineers, developers, security, cloud, support, QA?
How do teams collaborate to resolve issues that arise?
What are some common performance issues that your users have encountered in the past?
What kind of data do they use to monitor performance and resolve issues? Examples: application/service alerts, cloud metrics, server metrics, container metrics, and so on.
How can you cluster your data into the incidents you want?
Jane decides she wants to cluster her alerts based on the service related to specific nodes. Given this, she decides to cluster her alerts around the service
field. All alerts for the same service that arrive within the same time window will be grouped into one incident.
To cluster alerts into actionable incidents, you need to identify the data fields you want to use to determine whether two alerts are related. Here are some fields you might want to consider:
source
— Useful for clustering alerts by host, node, pod, etc.service
— Useful for clustering alerts by a common app or service.location
— Useful for clustering alerts by physical or virtual location.Custom tags — You can enrich your data with custom tags and use these tags to generate incidents. The only requirement is that the values are formatted consistently.
You can cluster based on multiple fields. For example, you might want to cluster Docker alerts by EC2. In this case, you would cluster around the source
and manager
fields. See Best practices for defining correlations.
What data do you currently have?
Jane does an inventory of the monitoring data used in her environment. Her teams use Amazon CloudWatch to monitor their cloud infrastructure, an open-source tool to monitor their apps and services, and Incident Management Collectors to monitor their servers, network, containers, databases, and other infrastructure elements.
Jane finds the following issues with her data:
The Collector and CloudWatch do not include a
service
field in their data. She needs to include the service in every alert.Different tools generate events with different formats. For example, she has events with
dns
andhostname
fields rather than asource
field.
Now that you know your desired goal, evaluate your current monitoring environment and the data they generate:
What monitoring tools does your organization use?
Make sure you include the APEX AIOps Incident Management Collector in your inventory of available data. The Collector makes it easy to collect time series metrics and identify performance anomalies throughout your environment.
What data do your teams use to investigate, troubleshoot, and resolve issues?
Identify the data streams, and the specific data fields in these streams, that are useful and relevant.
Examine the specific data fields in the data you are currently collecting. Compare these to the data fields in the Incident Management event schema.
Which data fields are missing from your monitoring data?
Which data fields need to be mapped to Incident Management events?
What additional information do you need? Is your incoming data consistently formatted?
To resolve these issues, she does the following:
Creates a custom integration that maps the non-Incident Management data fields in her raw data to their Incident Management equivalents.
Creates a CSV of all the monitored sources in her environment. Each row specifies one monitored source. The other columns in each row specify the service, the version number, the dev team that supports the service, and other useful information.
Creates a simple automated workflow that maps the relevant CSV data to each incoming event.
For best results, your incoming data should include all the relevant information to analyze, troubleshoot, and resolve your actionable incidents. The relevant data fields should be formatted consistently across all data streams.
Incident Management includes the following features for mapping, enriching, and formatting your raw data:
Create your own integration (CYOI)
The Incident Management event schema is highly flexible and generic. Custom integrations enable you to map your current data fields to their Incident Management equivalents. For example, you can map a
hostname
field to the equivalentsource
field.Use workflows to enrich event data
You can create automated workflows to enrich and normalize your raw data after ingestion. Workflows make your incidents more actionable, targeted, and useful.
If your data is missing relevant information, you can enrich your sources with data from an external catalog.
For example, you can create a catalog with information about the services running on specific nodes. The Query Catalog action can map this information to each relevant event automatically.
If your data is formatted inconsistently, you can normalize specific fields so that all events use the same format.
For example, suppose different data streams define the
class
field differently. The Match and Update action can update a field to use the same format across all events.
Incident Management automatically identifies and eliminates duplicate events. Optionally, you can customize the default deduplication behavior.
Incident Management automatically detects performance anomalies and creates an event for each anomaly. Optionally, you can customize the default anomaly-detection behavior for individual metrics.
How should you define your correlations?
Jane sets up her custom integration and workflow. She reviews her alerts and see that they include the data that she expects to see. The only remaining step is to define the correlation behavior she wants: If two or more alerts have the same service
field, cluster them into the same incident.
To define your correlations, go to the Incident Management UI and navigate to Correlate & Automate > Correlation Engine. Each definition specifies
The scope of alerts to correlate on
The fields to correlate on and the required degree of similarity between fields (100% requires an exact match).
The incident description.
The maximum correlation time window, from the arrival of the first alert in the incident.