Classic Sigaliser Moolet
Warning
Support for Classic Sigaliser is deprecated in the release of Moogsoft AIOps v7.3.0. If you want to use time-based clustering see Tempus. For other clustering, see Cookbook.
The Sigaliser Moolet, also known as Classic Sigaliser, is where in event processing, alert streams from the Alert Builder or the Alert Rules Engine are converted into Situations.
The Sigaliser is self-contained and has no Moobot. It takes every occurrence of an event in an alert stream and uses matrix factorisation algorithms to identify clusters of alerts that are temporally correlated identifying underlying service outages or Situations. The Sigaliser then updates its own internal knowledge of the stores of the Situations and the Moogsoft AIOps database before putting updates out on the Message Bus.
You can configure and tune Classic Sigaliser by editing the parameters in the $MOOGSOFT_HOME/config/moolets/sigaliser.conf
configuration file. Generally,
the types of Situations created for a given set of alerts are dependent on the rate of occurrence of
alerts. You correspond by adjusting the resolution of the window of the Sigaliser parameters to try
and match the activity.
The algorithms work by spotting signature scatter pattern of alerts with in a time period. Firstly, how many optimal clusters there are, which should correspond to the number of current, active, service threatening outages in the given window that the Sigaliser operates on. Secondly, it then optimally factorises it down into individual groups, which Moogsoft AIOps calls Situations. Once you have a Situation, a Situation Room is created in the Moogsoft AIOps database, and you are notified through the Situation View in the user interface.
The algorithm is run in semi real-time and is triggered by either:
-
A fixed polled time period.
-
A single time slice being filled up, the width of which is set by the resolution parameter in the configuration. For example, the first alert that arrives after the current slice has been filled will trigger the Sigaliser to run its algorithms.
You can define the Sigaliser behavior in the Sigaliser
section of the
Moogfarmd configuration file. In general, the following parameters can be configured to either
produce more Situations with fewer alerts, or, fewer Situations with more alerts. The consequence of
having more Situations with fewer alerts is that the same underlying outage could be split across
multiple Situations. Fewer Situations with more alerts results in the same Situation containing
alerts from multiple service outages. The process of tuning the Sigaliser parameters leads to an
optimal configuration, where, Situations sharply reflect the state of the managed systems. Moogsoft refers to Situations being “sharp” and well “resolved” when the
parameters give you the best fit of Situations to service outages.
Sigaliser
contains a number of properties. The name
, classname
, and run_on_startup
properties are shared with other Moolets.
{ name : "Sigaliser", classname : "CSigaliser", run_on_startup : false, process_output_of : "AlertBuilder" }
The name
is hardcoded and should never be changed from Sigaliser.
The classname
, CSigaliser
, is hardcoded and
should never be changed.
By default, run_on_startup
is set to false, so that when Moogfarmd
starts, it does not automatically create an instance of the Sigaliser. In this case you can start it
using farmd_ctrl
.
These properties in the Moolet direct which output should be processed:
Instructs the Moolet to process the output of the Alert Builder or Alert Rules Engine. Usually the Sigaliser connects directly to the Alert Builder, and the Alert Rules Engine is only used if automations are desired prior to Situation resolution. The Sigaliser can have only one input.
The Sigaliser runs the matrix factorization algorithms, the properties for which are as follows:
# Algorithm time_compression : true, alert_threshold : 2, membership_limit : 3, sig_similarity_limit : 0.7, sig_alert_horizon : 0.5, scale_by_severity : false, entropy_threshold : 0.0,
If set to true
, the algorithm will ignore any empty time buckets in the
Sigaliser calculation. If set to false, it will include the empty time buckets. We recommend that
you set time_compression
to true for low data rates and false for normal
data rates.
You only require time_compression
in scenarios where the data rate is
very low when compared to the values of window
and resolution
. In certain low data-rate scenarios it is possible for a window
or resolution
to contain no alerts. For
example if the data rate is two alerts per hour and the window
is 15
minutes, on average, some of the time buckets in any Situation calculation will be empty. When time_compression
is true
empty time-buckets are
removed from the calculation, but the total number of buckets used in the calculation remains the
same.
Defines the minimum number of alerts that a Situation can contain. So, increasing the alert_threshold
will reduce the total number of Situations. We recommend an
alert_threshold
of 2.
alert_threshold
can be used in conjunction with small values of membership_limit
to produce a smaller number of Situations, each of which has
more alerts.
The Situation creation process contains multiple steps, including a resolution and merging step. During the merging phase, the raw Situations from the factorization calculation are compared and merged with the currently active Situations. This detects when a detected Situation is either novel or an evolution in time of an existing Situation.
The membership_limit
property restricts the number of Situations in which
an alert can appear. As Situations become merged with each other over time, it is possible for an
alert to appear in more Situations than are defined by membership_limit
.
Changing the value of membership_limit
does not have a large impact on the
total number of Situations but does change the distribution of the number of alerts in each
Situation.
Decreasing the membership_limit
results in fewer Situations with more
alerts and more Situations containing a small numbers of alerts. Whereas, increasing membership_limit
results in, more Situations with a greater number of alerts
and fewer Situations containing a small numbers of alerts. Therefore, the optimal value seems to be
between one and five, with a recommended membership_limit
of three.
A measure of the similarity between two Situations before they are merged together. The value is the Jaccard Similarity Coefficient (JSC) defined as the ratio of shared alerts between two Situations to total unique alerts in both Situations.
For example, if Situation1 & Situation 2 share two common alerts, each Situation has one unique alert:
JSC = 2 (common alerts) / [1 (unique to Situation 1) + 2 (common to both) + 1 (unique to Situation 2)] = 2/(1+2+1) = 2/4 = 0.5.
Reducing the similarity index will reduce the total number of Situations. Smaller values increase
the likelihood of Situations being merged together, as they have to share fewer alerts in common to
be viewed as the same Situation. Conceptually, JSC values less than 0.5 are hard to justify as
grounds for merging, so should be used with care. We recommend a sig_similarity_limit
of 0.7.
When the Sigaliser algorithm initially identifies a Situation, it will contain alerts that are more representative of the Situation than others. This parameter, which takes the value between 0.0 and 1.0, allows you to provide a cut off for membership based upon the highest significant alert in the cluster. If you set this value to be 0.5, for example, only alerts that have a “significance” for the Situation that is more than half of the most significant alert in the Situation will be included. 0.5 is the default value.
The value of this parameter is the minimum entropy that an alert must possess to be included in the Sigaliser calculation. Any alert that arrives at the Sigaliser with entropy below this value will never be included in a Situation. It has a value between 0.0 and 1.0 and has a default of 0.0 which means every alert will be processed.
scaleBySev
allows you to bias Moogsoft AIOps
so that high severity alerts are treated as having higher entropy. If you had the same alert arrive
with a critical severity, versus a minor severity, you would give the critical severity the higher
entropy than the minor severity. This scaling is done as the severity constant number divided by the
maximum severity (5). So in the case of critical, you get all of the entropy and in the case of
minor, you get three fifths of the entropy. In the case of clear you would get an entropy value of
0.0.
The algorithm is run incrementally as events are ingested, as such Situations are produced and updated in real-time. There are two ways to trigger the algorithm: using a time interval or using the rate of the event stream.
# Triggers sig_on_bucket : true, sig_interval : 100, max_backlog : 1000000, # Time Buckets resolution : 120, window : 90
The optimal trigger for production should be sig_on_bucket=true
, provided
this ensures satisfactory Situation accuracy and that Situations are being regularly updated. sig_on_bucket
can also simulate real-time behavior using historical data.
When Situations are not being updated regularly enough, configure sig_on_bucket
= false
and set sig_interval
to a value no more than half of the real-time size of the window.
In a production environment, set max_backlog
to a high value to avoid
triggering the Sigaliser between timed executions. This parameter will cause the algorithms to run
if the number of events that arrive before either a scheduled execution, or a bucket being filled is
above this value. It should be used with care and only when you have an environment where the event
rate is highly variable.
If set to true
, the Sigaliser will run whenever a new time bucket occurs.
Depending upon the data rate, this has the effect of executing the Sigaliser after every defined
number of “resolution” seconds.
sig_on_bucket = true
deactivates both the sig_interval
and max_backlog
triggers.
Executes the Sigaliser algorithm every defined number of seconds, in the example above, every 100 seconds.
sig_interval
and max_backlog
do not override
each other; consequently, it is possible for the Sigaliser to be executed more frequently through
the sig_interval
value.
Executes the Sigaliser if the number of defined Alerts are received since last execution, in the example above, the Sigaliser is executed after 1,000,000 alerts are received.
The duration, in seconds, for each bucket of time that the event stream is divided into. A high value for the resolution will result in Situations that are less “sharp” in time, as the wider the bucket the more likely that alerts from disconnected outages will occur in the same bucket, and potentially in the same Situation.
The number of time-buckets to include in the calculation. The width of the window should be chosen
to match the average time period over which outages typically evolve. The total amount of time
considered in any Sigaliser calculation is window multiplied by the resolution
.
In general, for a high data rate you would use a smaller resolution
and
window
than for a low data rate. For a fixed data rate, a smaller resolution
will generally result in more Situations.
The diagram below illustrates how a Sigaliser can be triggered every 180 seconds if 'sig_on_bucket' is set to 'true', the time bucket resolution is set to '60' and the window is '3':
The diagram below illustrates how a Sigaliser can be triggered if 'sig_interval' is set to 120 seconds and if 'max_backlog' is set to 50,000 events: