Frequently asked questions

Q. What is the time granularity (data collection rate) for APEX AIOps Incident Management data?

A. For the integrations where APEX AIOps Incident Management actively collects data, the collection rates are:

APEX AIOps Incident Management Collector — Every 1 minute or 5 minutes, can be configured as preferred.
Amazon CloudWatch integration —Every 1 minute for both metrics and alarms.
Datadog integration — The metric polling rate is 1 minute (rate limited) and the event polling rate is 30 seconds (not rate limited).

For integrations that push data to Incident Management, the data source can push data as frequently as needed. In most cases you can send data every minute. Sending data at higher rates, such as every second, is generally not recommended.

Q. Does the collector compress and encrypt the data that it sends to Incident Management?

A. Yes. The collector compresses data using gzip and sends all data over HTTPS.

Q. Can I enrich my raw data with additional information after ingestion?

A. Yes. You can define event workflows that enrich and process your data immediately after ingestion. Enrichment is strongly recommended and has the following benefits:

You can fine-tune how Incident Management clusters your alerts into incidents.
You can make your alerts more informative and readable.
You can normalize events that come from different sources and have different formats.

Q. How long does APEX AIOps retain customer data?

A. Incident Management data is retained according to the following policies:

Alert and incident data

Alert and Incident data is stored in Elasticsearch (ES) in quarterly indexes. Indexes are named using quarterly identifiers similar to the following:

test-instance-22q1-incidents
test-instance-22q2-incidents
test-instance-22q3-incidents
test-instance-22q4-incidents
test-instance-23q1-incidents

The current strategy for removing old data is to wait until the next yearly quarter (23q1 in this example) is over before cleaning up the data in 22q1. This means that data is currently retained up to 15 months.

Metrics data

Metrics data (the individual data points and the metadata about metrics themselves) is stored in Thanos. This data is aggregated over time and includes the following configuration:

retentionResolutionRaw: 30d
retentionResolution5m: 120d
retentionResolution1h: 1y

To better understand these retention configurations and how they work, review the Thanos documentation.

Configuration items

There is no automatic removal of configuration items. These are retained forever. Examples include workflows, webhook configurations, correlation definitions.

Audit logs

Audit logs are stored for all configuration items and user-initiated actions on alerts and incidents. There is no automatic removal of these logs, which are accessed via their own API. Therefore, audit logs are not impacted by the purging of alert and incident records in line with the data retention rules.

Q. How long is data retained in the UI?

Incident data is accessible in the UI for 30 days. You can access Incident data via API for up to 15 months (see previous section).

Note that Auto Close settings do not affect this limit.

Q. Does Incident Management update metrics and events after ingestion?

A. Incident Management ingests raw events and events generated from metric anomalies. You can enrich and normalize events at ingestion using event workflows. For example, you can enrich events with information such as the apps or services that generated specific events. You can also process events from different sources so that all your event fields are formatted consistently.

Once your workflows finish processing events, Incident Management deduplicates the events into alerts. When it adds a new event to an alert, Incident Management it updates the alert with the latest information from the event. Thus you can think of events as hard-coded snapshots of an issue, while alerts get updated with each new event.

Q. Does Incident Management update alerts or incidents automatically?

Incident Management updates incidents and alerts according to settings defined by Auto-Close Policies. A default policy is included and it automatically closes and resolves alerts and incidents as follows:

Changes the alert status to Closed from any state after 72 hours.
Changes the alert status to Closed 30 minutes after it is set to Resolved.
Changes the incident status to Closed 60 minutes after it is set to Resolved, or when all alerts in that incident are closed (effectively resolving it).
Changes the incident status from any state to Closed after 7 days.

You can modify these defaults or add new policies to meet your organizational needs.

Q. Which browsers does Incident Management support?

A. Incident Management supports the latest versions of Chrome, Firefox, and Edge.

Q. What metric features does Incident Management have that makes it different from other products?

A. Incident Management includes the following metric features:

Metrics API
You can send metrics from all your monitoring services to one endpoint. The metric schema is highly flexible, with a few required fields, several more optional fields, and a tags field for custom information.
Collectors
Collectors are lightweight, easy-to-install agents that collect time series metrics and events on Linux, MacOS, and Windows servers and send the data to Incident Management.
Predefined anomaly detection
Incident Management detects anomalies by default and without configuration. You don't need to define thresholds or other parameters.
Customizable anomaly detection for individual metrics
You can customize how Incident Management detects anomalies for individual metrics with special characteristics. For example, you might want to fine-tune the anomaly-detection logic for metrics with very large or very small data ranges.