## Introduction

Speedbird groups events related to an actionable outage into clusters of their related Alerts. These clusters are service impacting, with the group of ‘clustered’ Alerts providing operational value to someone using the system.

Speedbird allows you to configure a set of parameters of an Event to drive the clustering in addition to time. For example, you may want a group of Alerts and Events together that have a co-incidence in time, but also have a coincidence in another value of the Event, such as, the hostname. Speedbird allows you to create clusters of Alerts with a similar hostname that have also occurred at a similar time.

## Speedbird's Algorithm

The algorithmic technique used by SpeedBird is based around K-means, which is a well-understood and traditional clustering algorithm that is a form of unsupervised machine learning. For the SpeedBird Moolet, Moogsoft AIOps uses some of the same algorithmic tool chain that is used in the Sigaliser along with the K-means algorithm. For instance, AIOps still uses the same time based determination of how many real clusters there are in the data at a given point in time. Non-negative matrix factorisation in the limit collapses into a K-means calculation, but is more computationally efficient.

## Configuration

To configure SpeedBird, the following should be read in conjunction with the Tuning guidelines, to enable you to produce optimal results.

#### sig_resolution

In **moog_farmd.conf**, there is a general `sig_resolution`

parameter grouping before the Moolet definitions with the following parameters:

sig_resolution : { alert_threshold : 1, sig_similarity_limit : 0.7 },

- These parameters are set for all Sigalisers whether it is SpeedBird or the traditional Sigaliser running in a given farmd. The
`sig_resolution`

parameters allow you to compare pre-existing Situations and determine if it is an evolution of an existing Situation, or, a new Situation

#### Moolet and Algorithm

The parameter groups Moolet and Algorithm function in the same way as those in the existing Sigaliser.

# Moolet name : "Speedbird", classname : "CSpeedbird", run_on_startup : false, process_output_of : "AlertBuilder", # Algorithm time_compression : true, scale_by_severity : true, entropy_threshold : 0.35,

For further information on these parameters, see the table below:

Parameter | Input | Description | Example |
---|---|---|---|

name | - | The name of the Sigaliser | Speedbird |

classname | - | The classname of the Sigaliser This is hardcoded and should never be changed | CSpeedbird |

run_on_startup | Boolean | If enabled, an instance of the Sigaliser will be created when moog_farmd starts. This is disabled by default | false |

process_output_of | AlertBuilder | This sets whether the Sigaliser processes the output of either the Alert Builder or Alert Rules Engine. The latter can only be used if automations are desired prior to the Situation resolution
| AlertBuilder |

time_compression | Boolean | If enabled the Sigaliser will ignore empty time buckets. If disabled, it will include empty time buckets
| true |

scale_by_severity | Boolean | If enabled, high severity Alerts are treated as having higher entropy. This scaling is done as the severity constant number divided by the maximum severity (5) | true |

entropy_threshold | Integer | The value of this parameter is the minimum entropy that the Alert must have to be included in the Sigaliser calculation. Any Alert that arrives at the Sigaliser with a lower entropy than this value will not be included in Situations.
| 0.35 |

#### sig_alert_horizon

The `sig_alert_horizon`

parameter allows you to prune clusters. The value allows you to control when you remove outlying Events from the cluster:

- If the value is less than <0.0, no pruning is undertaken.
- At 0.0, members that are further than one standard deviation from the centroid of the cluster are eliminated.
- At more than > 0.0 the standard deviation is multiplied by
`sig_alert_horizon`

, and then members further than mean`+ sig_alert_horizon*std_dev`

distance from centroid are eliminated.

sig_alert_horizon : 0.0,

Every cluster has a centroid, which is the average point in the middle of a cluster.

In the diagram above there are three points in a defined cluster (X), and the centroid (C), which is not a real point in this space phase but represents the center of the cluster. You compute the distance of each point from the centroid of the cluster, which results in an average distance and standard deviation. You can then work out the standard deviation to determine the spread of the cluster. A low standard deviation, i.e., 0, means all of the points are the same distance from the centroid; whereas, a high standard deviation means they are a highly variable distance from the centroid thus indicating a random cluster.

#### components

You can choose which parameters of an Event are used by the clustering algorithm. In the following example, "source", "source_id", "description" are declared:

components : [ "source","source_id","description" ]

Additionally, the system always takes into consideration the time that the Event arrives in the system (`event_time or last_occurred`

for an Alert). You can have as many components as you like, but, the more components that are selected, the greater numerical complexity is introduced into the system, and there is a chance you will get a smaller number of Alerts per cluster and less correlation.

#### Partitioning

There are two methods of partitioning the data into Situations. The first is 'partition_by' which splits the clusters according to the parameters specified. The second is 'pre_partition', which splits the incoming Event stream before clustering.

**Please note**: Pre-partitioning is recommended as it does not interfere with the results of the clustering algorithms

**partition_by**

After clustering has taken place and before you enter merging and resolution, you can split clusters into sub-clusters based on a component of the Events. For example, you can use the `manager`

parameter to ensure the Situations only contain Events from the same manager. In general, and by default, you should comment out the `partition_by`

parameter.

**pre_partition**

An alternative way of partitioning is to use `pre_partition`

which allows you to specify a component field (from the list of specified components) around which the Event stream will be partitioned before the K-means clustering occurs. The Alerts in the resulting Situations will each contain a single value for the component field chosen.

For example, if the SpeedBird `component`

option was set to:

components : [ "source","manager","description" ],

In the `metric`

below, the description component is being weighted more heavily compared to source and manager. Please note that the metric always contains one more values than the components specified and that the first value always corresponds to time.

default : [ 1,1,1,1000000],

This results in Situations containing Alerts with more similar `description`

fields and a variety of `source`

and `manager`

fields.

Adding the following property ensures that Situations contain Alerts with very similar `description`

fields, a variety of `source`

fields but only a single distinct `manager`

field.

pre-partition : "manager"

`pre_partition`

, like `partition_by`

, is defaulted to false in **moog_farmd.conf** so has no effect. If `pre_partition`

is not required there is no need to modify the existing **moog_farmd.conf** files to include the property.

It is possible to configure `pre_partition`

and `partition_by`

at the same time, but the `partition_by`

parameter will only have any effect if it is applied to a different component.

##### A note on time_compression and pre_partition

`pre_partition`

splits the Events into separate streams based on the component you have specified, as opposed to `partition_by`

, which allows the algorithms to work on the whole Event stream and then splits up the results.

Partitioning the Event stream using pre_partition can make time_compression less effective. There are many things in the tuning parameters and behaviours of the Sigalisers that depend upon the event rate, and because you are splitting the stream up, if you have an event rate of X and you split it into many streams, each of those streams is going to have an event rate of less than X. This can skew whether the tuning parameters you are using are appropriate, so with or without `time_compression`

you should be careful. With `time_compression`

, you expect to avoid silent moments in the Event stream, but this may not be the case because the effect of `pre_partition`

is to split the stream.

For example, if you `pre_partition`

on `manager`

, set `time_compression`

to `true`

, and set `window`

to `10`

and `resolution`

to `60`

, you will store up to 10 one-minute wide buckets of Events for clustering.

The Events could arrive as follows:

Bucket | | |
---|---|---|

1 | 1 | Andrew, Alan |

2 | 2 | Alan |

3 | 17 | Alan |

4 | 18 | Alan |

5 | 20 | Andrew |

6 | 35 | Alan |

7 | 37 | Alan |

8 | 38 | Alan |

9 | 57 | Alan |

10 | 59 | Alan |

11 | 60 | Alan |

It should be noted that the minute 1 bucket will be dropped from the Sigaliser window because AIOps only keeps the last ten live buckets. Clustering for Events with Manager Alan will only use nine buckets, and clustering for Events with Manager Andrew will only use 1 bucket.

**metric**

metric : { default : [ 1,1,1,1], categoryField: "agent", "DBMON" : [ 100,1,1000,1000000], "NETMON" : [ 1,100000000,1,0] },

The metric is a technical and detailed area of configuration, which relates to how Moogsoft measures distance between two events in the phase space used for clustering. Euclidean distance is easy to compute as you calculate the square of the differences in the components (in two dimensions the distance is the hypotenuse of a right-angled triangle, in three dimensions it is the diagonal measurement of a cuboid, and so on...) add them all up and this reveals the square of the distance. This example is a simplification.

For instance, if you have x, y and z as the components of a vector, the square root of the distance is:

You can put a number in front of these sums of squares, and the values are more correctly known as the diagonal metric tensor values. Moogsoft assumes that you should only ever consider the diagonal metric tensor values; however, in general co-ordinate geometry you can contribute to the distance by adding in, for example, *(y-z)2*. It is not considered useful to compare different attributes of an event for similarity.

This approach allows you to weight the distance between two events based upon their components. For example, if X represents time, Y represents source and Z represents manager, and you make a2 much bigger than a1. Any distance in source creates a lot more distance between the events than the same distance in time. This allows you to weight the importance. This is why you have four component values in all the different metrics. The default is [1,1,1,1]. You can also select a `category Field`

, which is a parameter in the event, i.e., `categoryField: "agent"`

.

In the example configuration above, if one of the events has a value DBMON, then you use the metric [100,1,1000,1000000] to weight the distance; otherwise, if NETMON, you use the alternate metric [1,100000000,1,0]. If you have neither of these two values, you use default. This allows configuration of different metric weightings for different sources of events.

#### string_len_cutoff

This determines the maximum number of characters in a component to use in the distance calculation described in the previous section. This cutoff will apply to all string components being used.

For example, if there are occasionally very long descriptions, you can specify a 64-character cutoff which will avoid excessive computation. See example below:

string_len_cutoff : 64

#### spread_cutoff

Whereas the `sig_alert_horizon`

is used to take events out of clusters, `spread_cutoff`

determines whether or not to consider a cluster to be worth processing.

spread_cutoff : 5.0

- 0 means all clusters have to be one hundred percent tight, so the same distance from the center with no variation; otherwise, the cluster will be discarded. A higher number allows for looser clusters, i.e. more variation within the cluster.

The spread cutoff uses the cluster standard deviation, after any outliers have been pruned in accordance with the `sig_alert_horizon`

parameter to determine, which clusters should be rejected. 0.0 means that all clusters have to be one hundred percent tight, i.e., with all members matching the cluster centroid. A higher number allows for more loosely correlated clusters. It is worth noting that the metrics chosen for weighting the components can have a direct impact on the standard deviation of the clusters generated, and it may be necessary to increase the spread_cutoff value to reflect this.

#### ignore_case

When comparing strings, determines if the translation of strings into a number in ‘phase’ space is case sensitive. In general, case should be ignored. See below:

ignore_case : true,

#### iterations

Unless “Entropy” seeding is specified, the initial seeds for K-means clustering includes a random element that will lead to different solutions on different iterations. If more than one iteration is chosen Speedbird will select the best solution of those returned for Situation processing. For higher numbers of iterations, K-means clustering will tend to converge on an optimal solution, which in turn leads to lower variance from one Speedbird run to another. Iterations however take both time and CPU resources so a sensible compromise between speed and the optimal solution is needed.

iterations : 5,

- Moogsoft recommends a value of 5

#### seeding

Seeding can be set to 'Kmpp', 'Lloyd', or 'Entropy'. Both 'Kmpp' (recommended) and' Lloyd' use random elements to select seeds to initialise the clustering process, and therefore have the advantage of finding different cluster solutions over multiple iterations.

Alternatively, 'Entropy' selects the highest entropy Events to seed clusters, and as such, returns the same results on each occasion. It should be noted that this is not necessarily an optimal result.

seeding : "Kmpp",

#### force_causal

Setting `force_causal `

to true ensures that Events which are part of causal Alerts are preserved. They are never discarded during the K-means clustering process, but are always returned as a member of a cluster.

The entropy range for causal alerts is defined in the `moogdb.significance`

table.

force_causal : true

#### generate_stats

`generate_stats`

provides detailed logging useful for tuning purposes. Detailed logging is written at log level WARN to the **moogfarmd.log** file. The logging contains detailed information around event clustering, and also includes information about partitioning.

generate_stats : "true"

- If
`generate_stats`

is not required there is no need to modify an existing**moog_farmd.conf**files to include the property

## Tuning guidelines

To ensure you produce useful results, it is recommended that you read the following in conjunction with description of the configuration parameters:

- Disable parameters which remove Alerts from Situations and discard Situations which are poorly correlated. Start with
`sig_alert_horizon`

set to -0.1 (to prevent any outliers from being pruned) and`spread_cutoff`

set to a high value (to prevent any clusters from being discarded). Subsequently modify these parameters to reduce Situation size and numbers. - When tuning the system, consider using 'Entropy' seeding and only switching to 'Kmpp' when you are happy with the results. 'Entropy' seeding always produces the same Situations, unlike 'Kmpp' or 'Lloyd', but often not the most appropriate ones. Using 'Entropy' seeding guarantees you can normally run a dataset once to see if the parameters you have used have given you the desired effect. 'Kmpp' seeding usually produces the best Situations with a moderate number of iterations.
The K in K-means indicates the number of seeds AIOps clusters around and the number of Situations which are produced. It is calculated using a technique that analyses the dataset to establish the number of independent clusters of events. The calculation is dependent on number of time slices (window), and the effective event rate (after entropy thresholds etc.) which determines the number of unique signatures received in

`resolution*window seconds`

.**Please note**: Moogsoft advises that you start by asking how many tickets/Situations are expected in a day, and you adjust the`resolution/window`

parameters to achieve the same number of Situations in a day’s worth of data. The value of k is never greater than the`window`

and the number of unique alerts in the total window, and is often about 80% of this valueIf you are using time and one other component, be prepared to significantly reduce the time metric, as well as, increasing the value of the metric for the other component. For example, assume that you have the following configuration and that you are interested in a series of events that occur over 10 minutes:

components : ["source"] metric : { default : [1,1000000] }

The time spread of the cluster you are interested is 600 seconds. If you have increased the metric a lot on the source component, your cluster may contain a single value for source (or very closely related values). Therefore, the cluster spread value will be generated largely or entirely by the event time component. K-means solutions that split this set of events into more than one cluster are preferred over those that keep them in a single cluster. If you use default metrics [1,1], the clustering will mostly be primarily driven by time.

- The metrics that you use may affect the
`spread_cutoff`

. If you increase a metric it may be necessary to increase the spread cut-off by quite a large amount (up to the square root of the increase of the metric). - It is the square root of the metric that is applied to a component. If you increase a metric for a component from 1 to 100, you emphasise the effect of that component on the resulting clusters by a factor of 10.
- Do not vary more than one configuration parameter at a time.
- Start with small data sets and limited (i.e., time plus one other) components before increasing the size and, or, complexity of your solution.