Page tree
Skip to end of metadata
Go to start of metadata

Introduction

The Sigaliser Moolet, also known as Sigaliser Classic, is where in Event processing, Alert streams from the Alert Builder or the Alert Rules Engine are converted into Situations.

The Sigaliser is self-contained and has no MooBot. It takes every occurrence of an Event in an Alert stream and uses matrix factorisation algorithms to identify clusters of Alerts that are temporally correlated identifying underlying service outages or Situations. The Sigaliser then updates its own internal knowledge of the stores of the Situations and the AIOps database before putting updates out on the MooMS bus.

Basic concepts

There are a number of parameters in moog_farmd.conf which allow you to tune the type of Situations created by the Sigaliser. Generally, the types of Situations created for a given set of alerts are dependent on the rate of occurrence of alerts. You correspond by adjusting the resolution of the window of the Sigaliser parameters to try and match the activity. 

The algorithms work by spotting signature scatter pattern of Alerts with in a time period. Firstly, how many optimal clusters there are, which should correspond to the number of current, active, service threatening outages in the given window that the Sigaliser operates on. Secondly, it then optimally factorises it down into individual groups, which Moogsoft calls Situations. Once you have a Situation, a Situation Room is created in the MOOG database, and you are notified through the Situation View in the User Interface. 

The algorithm is run in semi real-time and is triggered by either:

  • A fixed polled time period
  • A single time slice being filled up, the width of which is set by the resolution parameter in the configuration. For example, the first alert that arrives after the current slice has been filled will trigger the Sigaliser to run its algorithms

Sigaliser configuration walk-through

The behaviour of the Sigaliser is defined in the moog_farmd configuration file in a section titled Sigaliser. In general, the following parameters can be configured to either: produce more Situations with fewer alerts, or, fewer Situations with more alerts. The consequence of having more Situations with fewer alerts is that the same underlying outage could be split across multiple Situations. Fewer Situations with more alerts results in the same Situation containing alerts from multiple service outages. The process of tuning the Sigaliser parameters leads to an optimal configuration, where, Situations sharply reflect the state of the managed systems. Moogsoft refers to Situations being “sharp” and well “resolved” when the parameters give you the best fit of Situations to service outages. 

Sigaliser contains a number of parameters. nameclassname, and run_on_startup  are shared with other Moolets. 

        {
            name : "Sigaliser",
             classname : "CSigaliser",
             run_on_startup : false,
             #process_output_of : "AlertRulesEngine",
             process_output_of : "AlertBuilder",

name

name is hardcoded and should never be changed from Sigaliser.

classname

The classnameCSigaliser, is hardcoded and should never be changed.

run_on_startup

By default, run_on_startup is set to false, so that when moog_farmd starts, it does not automatically create an instance of the Sigaliser. In this case you can start it using farmd_ctrl.

Undertaking the sigalising

The next two parameters in the Moolet direct which output should be processed:

process_output_of

process_output_of informs the Moolet to process the output of the Alert Builder or Alert Rules Engine. Usually the Sigaliser connects directly to the AlertBuilder, the AlertRulesEngine only being used if automations are desired prior to Situation resolution. The Sigaliser can have only one input.

Algorithmics

The Sigaliser runs the matrix factorisation algorithms, the parameters for which are identified in the configuration below: 

              # Algorithm 
            time_compression : true, 
            alert_threshold : 2, 
            membership_limit : 3, 
            sig_similarity_limit : 0.7, 
            sig_alert_horizon : 0.5, 
            scale_by_severity : false, 
            entropy_threshold : 0.0, 

time_compression

If set to true, the algorithm will ignore any empty time buckets in the Sigaliser calculation. If set to false, it will include the empty time buckets. Moogsoft recommends for low data-rates you should set time_compression to true, and for normal data-rates, time_compression should be set to false. 

You only require time_compression in scenarios where the data rate is very low when compared to the values of window and resolution. In certain low data-rate scenarios it is possible for a window or resolution to contain no alerts. For example if the data rate is two alerts per hour and the window is 15 minutes, on average, some of the time buckets in any Situation calculation will be empty. When time_compression is true empty time-buckets are removed from the calculation, but the total number of buckets used in the calculation remains the same.

alert_threshold

Defines the minimum number of alerts that a Situation can contain. So, increasing the alert_threshold will reduce the total number of Situations. Moogsoft recommends an alert_threshold of 2. 

 alert_threshold can be used in conjunction with small values of membership_limit to produce a smaller number of Situations, each of which has more alerts.

membership_limit

The Situation creation process contains multiple steps, including a resolution and merging step. During the merging phase, the raw Situations from the factorization calculation are compared and merged with the currently active Situations. This detects when a detected Situation is either novel or an evolution in time of an existing Situation.

membership_limit  restricts the number of Situations that an alert can appear in. As Situations get merged with each other over time, it is possible for an alert to appear in more Situations than are defined by membership_limit. Changing the value of membership_limit  does not have a large impact on the total number of Situations but does change the distribution of the number of alerts in each Situation.

Decreasing the membership_limit  results in fewer Situations with more alerts and more Situations containing a small numbers of alerts. Whereas, increasing membership_limit  results in, more Situations with a greater number of alerts and fewer Situations containing a small numbers of alerts. Therefore, the optimal value seems to be between one and five, with a recommended membership_limit  of three. 

sig_similarity_limit (Jaccard Similarity Coefficient)

A measure of the similarity between two Situations before they are merged together. The value is the Jaccard Similarity Coefficient (JSC) defined as the ratio of shared Alerts between two Situations to total unique Alerts in both Situations.

For example, if Situation1 & Situation 2 share two common Alerts, each Situation has one unique Alert:
JSC = 2 (common Alerts) / [1 (unique to Situation 1) + 2 (common to both) + 1 (unique to Situation 2)] =  2/(1+2+1) = 2/4 = 0.5

Reducing the similarity index will reduce the total number of Situations. Smaller values increase the likelihood of Situations being merged together, as they have to share fewer Alerts in common to be viewed as the same Situation. Conceptually, JSC values less than 0.5 are hard to justify as grounds for merging, so should be used with care. Moogsoft recommends a sig_similarity_limit of 0.7.

sig_alert_horizon    

When the Sigaliser algorithm initially identifies a Situation, it will contain alerts that are more representative of the Situation than others. This parameter, which takes the value between 0.0 and 1.0, allows you to provide a cut off for membership based upon the highest significant alert in the cluster. If you set this value to be 0.5, for example, only alerts that have a “significance” for the Situation that is more than half of the most significant alert in the Situation will be included. 0.5 is the default value.                      

entropy_threshold

The value of this parameter is the minimum entropy that an alert must possess to be included in the Sigaliser calculation. Any alert that arrives at the Sigaliser with entropy below this value will never be included in a Situation. It has a value between 0.0 and 1.0 and has a default of 0.0 which means every alert will be processed.

scale_by_severity

scaleBySev allows you to bias MOOG so that high severity alerts are treated as having higher entropy. If you had the same alert arrive with a critical severity, versus a minor severity, you would give the critical severity the higher entropy than the minor severity. This scaling is done as the severity constant number divided by the maximum severity (5). So in the case of critical, you get all of the entropy and in the case of minor, you get three fifths of the entropy. In the case of clear you would get an entropy value of 0.0.

Triggers and Time Buckets

The algorithm is run incrementally as Events are ingested, as such Situations are produced and updated in real-time. There are two ways to trigger the algorithm: using a time interval or using the rate of the Event stream.

               # Triggers
               sig_on_bucket : true,
               sig_interval : 100,
               max_backlog : 1000000,
               # Time Buckets 
               resolution : 120, 
               window : 90 
 } 

The optimal trigger for production should be sig_on_bucket=true, provided this ensures satisfactory Situation accuracy and that Situations are being regularly updated. sig_on_bucket can also simulate real-time behavior using historical data. 

When Situations are not being updated regularly enough, configure sig_on_bucket = false and set sig_interval to a value no more than half of the real-time size of the window. 

In a production environment, set max_backlog to a high value to avoid triggering the Sigaliser between timed executions. This parameter will cause the algorithms to run if the number of Events that arrive before either a scheduled execution, or a bucket being filled is above this value. It should be used with care and only when you have an environment where the event rate is highly variable.

sig_on_bucket

If set to true, the Sigaliser will run whenever a new time bucket occurs. Depending upon the data rate, this has the effect of executing the Sigaliser after every defined number of “resolution” seconds. 

sig_on_bucket = true deactivates both the sig_interval and max_backlog triggers.

sig_interval

Executes the Sigaliser algorithm every defined number of seconds, in the example above, every 100 seconds. 

sig_interval and max_backlog do not override each other; consequently, it is possible for the Sigaliser to be executed more frequently through the sig_interval value.

max_backlog

Executes the Sigaliser if the number of defined Alerts are received since last execution, in the example above, the Sigaliser is executed after 1,000,000 Alerts are received.

resolution

The duration, in seconds, for each bucket of time that the event stream is divided into. A high value for the resolution will result in Situations that are less “sharp” in time, as the wider the bucket the more likely that alerts from disconnected outages will occur in the same bucket, and potentially in the same Situation.

window

The number of time-buckets to include in the calculation. The width of the window should be chosen to match the average time period over which outages typically evolve. The total amount of time considered in any Sigaliser calculation is window multiplied by the resolution

In general, for a high data rate you would use a smaller resolution and window than for a low data rate. For a fixed data rate, a smaller resolution will generally result in more Situations.

Diagrams

The diagram below illustrates how a sigaliser can be triggered every 180 seconds if 'sig_on_bucket' is set to 'true', the time bucket resolution is set to '60' and the window is '3':



The diagram below illustrates how a sigaliser can be triggered if 'sig_interval' is set to 120 seconds and if 'max_backlog' is set to 50,000 Events:


  • No labels