Page tree
Skip to end of metadata
Go to start of metadata

Introduction

Speedbird groups events related to an actionable outage into clusters of their related Alerts. These clusters are service impacting, with the group of ‘clustered’ Alerts providing operational value to someone using the system. 

Speedbird allows you to configure a set of parameters of an Event to drive the clustering in addition to time. For example, you may want a group of Alerts and Events together that have a co-incidence in time, but also have a coincidence in another value of the Event, such as, the hostname. Speedbird allows you to create clusters of Alerts with a similar hostname that have also occurred at a similar time.

Speedbird's Algorithm

The algorithmic technique used by SpeedBird is based around K-means, which is a well-understood and traditional clustering algorithm that is a form of unsupervised machine learning. For the SpeedBird Moolet, Moogsoft AIOps uses some of the same algorithmic tool chain that is used in the Sigaliser along with the K-means algorithm. For instance, AIOps still uses the same time based determination of how many real clusters there are in the data at a given point in time. Non-negative matrix factorisation in the limit collapses into a K-means calculation, but is more computationally efficient. 

Configuration

To configure SpeedBird, the following should be read in conjunction with the Tuning guidelines, to enable you to produce optimal results. 

sig_resolution

In moog_farmd.conf, there is a general sig_resolution parameter grouping before the Moolet definitions with the following parameters:

            sig_resolution :
            {
                 alert_threshold : 1,
                 sig_similarity_limit : 0.7
            },
  • These parameters are set for all Sigalisers whether it is SpeedBird or the traditional Sigaliser running in a given farmd. The sig_resolution parameters allow you to compare pre-existing Situations and determine if it is an evolution of an existing Situation, or, a new Situation

Moolet and Algorithm

The parameter groups Moolet and Algorithm function in the same way as those in the existing Sigaliser. 

            # Moolet
            name : "Speedbird",
            classname : "CSpeedbird",
            run_on_startup : false,
            process_output_of : "AlertBuilder",

            # Algorithm
            time_compression : true,
            scale_by_severity : true,
            entropy_threshold : 0.35,

For further information on these parameters, see the table below:

ParameterInputDescriptionExample
name
-The name of the Sigaliser
Speedbird
classname
-

The classname of the Sigaliser

This is hardcoded and should never be changed

CSpeedbird
run_on_startup

Boolean

If enabled, an instance of the Sigaliser will be created when moog_farmd starts.

This is disabled by default

false
process_output_of

AlertBuilder
AlertRulesEngine

This sets whether the Sigaliser processes the output of either the Alert Builder or Alert Rules Engine.

The latter can only be used if automations are desired prior to the Situation resolution

Please note: The Sigaliser can only have one input

AlertBuilder
time_compression
Boolean

If enabled the Sigaliser will ignore empty time buckets. If disabled, it will include empty time buckets

Please note: For low data rates you should set this to 'true', for normal data rates set this to 'false'

true
scale_by_severity
BooleanIf enabled, high severity Alerts are treated as having higher entropy. This scaling is done as the severity constant number divided by the maximum severity (5)
true
entropy_threshold
Integer

The value of this parameter is the minimum entropy that the Alert must have to be included in the Sigaliser calculation. Any Alert that arrives at the Sigaliser with a lower entropy than this value will not be included in Situations.

Please note: The default value of 0.0 means every Alert will be processed by the Sigaliser

0.35


sig_alert_horizon

The sig_alert_horizon parameter allows you to prune clusters. The value allows you to control when you remove outlying Events from the cluster:

  • If the value is less than <0.0, no pruning is undertaken. 
  • At 0.0, members that are further than one standard deviation from the centroid of the cluster are eliminated. 
  • At more than > 0.0 the standard deviation is multiplied by sig_alert_horizon, and then members further than mean + sig_alert_horizon*std_dev distance from centroid are eliminated.
sig_alert_horizon : 0.0,

Every cluster has a centroid, which is the average point in the middle of a cluster.


In the diagram above there are three points in a defined cluster (X), and the centroid (C), which is not a real point in this space phase but represents the center of the cluster. You compute the distance of each point from the centroid of the cluster, which results in an average distance and standard deviation. You can then work out the standard deviation to determine the spread of the cluster. A low standard deviation, i.e., 0, means all of the points are the same distance from the centroid; whereas, a high standard deviation means they are a highly variable distance from the centroid thus indicating a random cluster. 

components

You can choose which parameters of an Event are used by the clustering algorithm. In the following example, "source", "source_id", "description" are declared:

            components : [ "source","source_id","description" ]

Additionally, the system always takes into consideration the time that the Event arrives in the system (event_time or last_occurred for an Alert). You can have as many components as you like, but, the more components that are selected, the greater numerical complexity is introduced into the system, and there is a chance you will get a smaller number of Alerts per cluster and less correlation. 

Partitioning

There are two methods of partitioning the data into Situations. The first is 'partition_by' which splits the clusters according to the parameters specified. The second is 'pre_partition', which splits the incoming Event stream before clustering. 

Please note: Pre-partitioning is recommended as it does not interfere with the results of the clustering algorithms

partition_by

After clustering has taken place and before you enter merging and resolution, you can split clusters into sub-clusters based on a component of the Events. For example, you can use the manager parameter to ensure the Situations only contain Events from the same manager. In general, and by default, you should comment out the partition_by parameter.

pre_partition

An alternative way of partitioning is to use pre_partition  which allows you to specify a component field (from the list of specified components) around which the Event stream will be partitioned before the K-means clustering occurs. The Alerts in the resulting Situations will each contain a single value for the component field chosen.

For example, if the SpeedBird component option was set to:

            components : [ "source","manager","description" ],

In the metric below, the description component is being weighted more heavily compared to source and manager. Please note that the metric always contains one more values than the components specified and that the first value always corresponds to time. 

            default : [ 1,1,1,1000000],

This results in Situations containing Alerts with more similar description fields and a variety of source and manager fields.

Adding the following property ensures that Situations contain Alerts with very similar description fields, a variety of source fields but only a single distinct manager field.

                                pre-partition : "manager" 

pre_partition, like partition_by, is defaulted to false in moog_farmd.conf  so has no effect. If pre_partition is not required there is no need to modify the existing moog_farmd.conf files to include the property.

It is possible to configure pre_partition and partition_by at the same time, but the partition_by parameter will only have any effect if it is applied to a different component.

A note on time_compression and pre_partition

pre_partition splits the Events into separate streams based on the component you have specified, as opposed to partition_by, which allows the algorithms to work on the whole Event stream and then splits up the results.

Partitioning the Event stream using pre_partition can make time_compression less effective. There are many things in the tuning parameters and behaviours of the Sigalisers that depend upon the event rate, and because you are splitting the stream up, if you have an event rate of X and you split it into many streams, each of those streams is going to have an event rate of less than X. This can skew whether the tuning parameters you are using are appropriate, so with or without time_compression you should be careful. With time_compression, you expect to avoid silent moments in the Event stream, but this may not be the case because the effect of pre_partition is to split the stream.

For example, if you pre_partition on manager, set time_compression to true, and set window to 10 and resolution  to 60, you will store up to 10 one-minute wide buckets of Events for clustering.

The Events could arrive as follows:

Bucket

 Minute

 Manager

1

 1

 Andrew, Alan

2

 2

 Alan

3

 17

 Alan

4

 18

 Alan

5

 20

 Andrew

6

 35

 Alan

7

 37

 Alan

8

 38

 Alan

9

 57

 Alan

10

 59

 Alan

11

 60

 Alan

It should be noted that the minute 1 bucket will be dropped from the Sigaliser window because AIOps only keeps the last ten live buckets. Clustering for Events with Manager Alan will only use nine buckets, and clustering for Events with Manager Andrew will only use 1 bucket. 

metric

            metric :   { 
                default : [ 1,1,1,1],
                categoryField: "agent",
                "DBMON" : [ 100,1,1000,1000000],
                "NETMON" : [ 1,100000000,1,0]
                        },

The metric is a technical and detailed area of configuration, which relates to how Moogsoft measures distance between two events in the phase space used for clustering. Euclidean distance is easy to compute as you calculate the square of the differences in the components (in two dimensions the distance is the hypotenuse of a right-angled triangle, in three dimensions it is the diagonal measurement of a cuboid, and so on...) add them all up and this reveals the square of the distance. This example is a simplification. 

For instance, if you have x, y and z as the components of a vector, the square root of the distance is:

You can put a number in front of these sums of squares, and the values are more correctly known as the diagonal metric tensor values. Moogsoft assumes that you should only ever consider the diagonal metric tensor values; however, in general co-ordinate geometry you can contribute to the distance by adding in, for example, (y-z)2. It is not considered useful to compare different attributes of an event for similarity.

This approach allows you to weight the distance between two events based upon their components. For example, if X represents time, Y represents source and Z represents manager, and you make a2 much bigger than a1. Any distance in source creates a lot more distance between the events than the same distance in time. This allows you to weight the importance. This is why you have four component values in all the different metrics. The default is [1,1,1,1]. You can also select a category Field, which is a parameter in the event, i.e., categoryField: "agent".

In the example configuration above, if one of the events has a value DBMON, then you use the metric [100,1,1000,1000000] to weight the distance; otherwise, if NETMON, you use the alternate metric [1,100000000,1,0]. If you have neither of these two values, you use default. This allows configuration of different metric weightings for different sources of events. 

string_len_cutoff

This determines the maximum number of characters in a component to use in the distance calculation described in the previous section. This cutoff will apply to all string components being used.

For example, if there are occasionally very long descriptions, you can specify a 64-character cutoff which will avoid excessive computation. See example below:

            string_len_cutoff : 64

spread_cutoff

Whereas the sig_alert_horizon is used to take events out of clusters, spread_cutoff determines whether or not to consider a cluster to be worth processing.

spread_cutoff : 5.0
  • 0 means all clusters have to be one hundred percent tight, so the same distance from the center with no variation; otherwise, the cluster will be discarded. A higher number allows for looser clusters, i.e. more variation within the cluster.

The spread cutoff uses the cluster standard deviation, after any outliers have been pruned in accordance with the sig_alert_horizon parameter to determine, which clusters should be rejected. 0.0 means that all clusters have to be one hundred percent tight, i.e., with all members matching the cluster centroid. A higher number allows for more loosely correlated clusters. It is worth noting that the metrics chosen for weighting the components can have a direct impact on the standard deviation of the clusters generated, and it may be necessary to increase the spread_cutoff value to reflect this.

ignore_case

When comparing strings, determines if the translation of strings into a number in ‘phase’ space is case sensitive. In general, case should be ignored. See below:

            ignore_case : true,

iterations

Unless “Entropy” seeding is specified, the initial seeds for K-means clustering includes a random element that will lead to different solutions on different iterations. If more than one iteration is chosen Speedbird will select the best solution of those returned for Situation processing. For higher numbers of iterations, K-means clustering will tend to converge on an optimal solution, which in turn leads to lower variance from one Speedbird run to another. Iterations however take both time and CPU resources so a sensible compromise between speed and the optimal solution is needed.

            iterations : 5,
  • Moogsoft recommends a value of 5

seeding

Seeding can be set to 'Kmpp', 'Lloyd', or 'Entropy'. Both 'Kmpp' (recommended) and' Lloyd' use random elements to select seeds to initialise the clustering process, and therefore have the advantage of finding different cluster solutions over multiple iterations. 

Alternatively, 'Entropy' selects the highest entropy Events to seed clusters, and as such, returns the same results on each occasion. It should be noted that this is not necessarily an optimal result.

            seeding : "Kmpp", 

force_causal

Setting force_causal to true ensures that Events which are part of causal Alerts  are preserved. They are never discarded during the K-means clustering process, but are always returned as a member of a cluster. 

The entropy range for causal alerts is defined in the moogdb.significance table.

			force_causal : true

generate_stats

generate_stats provides detailed logging useful for tuning purposes. Detailed logging is written at log level WARN to the moogfarmd.log file. The logging contains detailed information around event clustering, and also includes information about partitioning.

            generate_stats : "true"
  •  If generate_stats is not required there is no need to modify an existing moog_farmd.conf files to include the property

Tuning guidelines

To ensure you produce useful results, it is recommended that you read the following in conjunction with description of the configuration parameters:

  1. Disable parameters which remove Alerts from Situations and discard Situations which are poorly correlated. Start with sig_alert_horizon set to -0.1 (to prevent any outliers from being pruned) and spread_cutoff  set to a high value (to prevent any clusters from being discarded). Subsequently modify these parameters to reduce Situation size and numbers. 

  2. When tuning the system, consider using 'Entropy' seeding and only switching to 'Kmpp' when you are happy with the results. 'Entropy' seeding always produces the same Situations, unlike 'Kmpp' or 'Lloyd', but often not the most appropriate ones. Using 'Entropy' seeding guarantees you can normally run a dataset once to see if the parameters you have used have given you the desired effect. 'Kmpp' seeding usually produces the best Situations with a moderate number of iterations.

  3. The K in K-means indicates the number of seeds AIOps clusters around and the number of Situations which are produced. It is calculated using a technique that analyses the dataset to establish the number of independent clusters of events. The calculation is dependent on number of time slices (window), and the effective event rate (after entropy thresholds etc.) which determines the number of unique signatures received in resolution*window seconds

    Please note: Moogsoft advises that you start by asking how many tickets/Situations are expected in a day, and you adjust the resolution/window parameters to achieve the same number of Situations in a day’s worth of data. The value of k is never greater than the window and the number of unique alerts in the total window, and is often about 80% of this value

  4. If you are using time and one other component, be prepared to significantly reduce the time metric, as well as, increasing the value of the metric for the other component. For example, assume that you have the following configuration and that you are interested in a series of events that occur over 10 minutes: 

           components : ["source"] 
           metric : { 
                      default : [1,1000000] 
                    }

    The time spread of the cluster you are interested is 600 seconds. If you have increased the metric a lot on the source component, your cluster may contain a single value for source (or very closely related values). Therefore, the cluster spread value will be generated largely or entirely by the event time component. K-means solutions that split this set of events into more than one cluster are preferred over those that keep them in a single cluster. If you use default metrics [1,1], the clustering will mostly be primarily driven by time. 

  5. The metrics that you use may affect the spread_cutoff. If you increase a metric it may be necessary to increase the spread cut-off by quite a large amount (up to the square root of the increase of the metric). 

  6. It is the square root of the metric that is applied to a component. If you increase a metric for a component from 1 to 100, you emphasise the effect of that component on the resulting clusters by a factor of 10. 

  7. Do not vary more than one configuration parameter at a time.

  8. Start with small data sets and limited (i.e., time plus one other) components before increasing the size and, or, complexity of your solution. 


  • No labels