# Events Analyser

• Stream and Partition-based Analysis

• Natural Language Processing Analysis

• Tokenization of Text

• Token Type Identification

• Token Masking

• Language Processing Techniques

• Priority Words

• Token Variation Threshold

The Events Analyser utility is a standalone process. It uses Natural Language Processing (NLP) techniques to analyze inbound event data. The Events Analyser divides text fields within the events into tokens. Based on the frequency of these tokens appearing in other events, it assigns an entropy value to the tokens and to the alerts in Moogsoft Enterprise. See the Entropy Overview for more information on how Moogsoft Enterprise evaluates entropy and uses entropy thresholds to reduce the level of 'noise' from incoming event data.

## Stream and partition-based analysis

You can configure Moogsoft Enterprise so that the Events Analyser calculates the entropy values for events from different streams for Moogsoft Enterprise as a whole, even though those streams have no relationship with each other.

You can also configure the Events Analyser so that it calculates the entropy values for events for different partitions. As an example, you may want to run separate entropy calculations for different regions. In this case, you should specify the alert field that identifies the region in the partition_by field in the Events Analyser configuration file. In this type of configuration, the same token can be given multiple entropy values within the same Moogfarmd deployment based on its frequency in the events within each partition. You can set up different configuration options for the different partitions. For example, in a particular partition, IP addresses may be masked whilst for another partition that may be unnecessary. In general, if a deployment uses the “pre-partition” method in Moogfarmd, that deployment benefits from partition-based entropy calculations.

See Multiple Streams and Partitions for more information on running the Events Analyser with different streams and partitions. See Configure Events Analyser for further information on non-partitioned and partitioned configurations.

## Natural language processing analysis

The Events Analyser utility performs a number of linguistic analyses on events. It then uses this linguistic analysis to calculate an entropy value for each token and then for every alert. See the Entropy Overview for more information.

## Tokenization of text

The Events Analyser splits a text string at word boundaries, such as spaces or punctuation marks, into blocks. Each block of text is known as a token. For example, the following description has five tokens:

Link down on port 2/32

## Token type identification

Commonly used word boundaries are often integral to the meaning of a token, for example, dots in IPV4 addresses. The Events Analyser identifies complete tokens of the following types within the structure of an event:

• IP addresses:

• v4

• v6

• MAC addresses

• OIDs

• Dates: Most standard formats.

• Numbers:

• Integers

• Real numbers

• With and without unit suffixes, for example, 99%, 12kb, 345ms.

• File paths:

• Forward slashes

• Backward slashes

• GUIDs

• Hexadecimal numbers: With the 0x prefix.

• URLs

• Email addresses: Most standard formats.

Identifying token types in arbitrary text is not an exact science and so, occasionally, the algorithms may identify tokens as a certain type which seems incorrect to a human.

After the Events Analyser has identified the token types, it can use them for masking and to identify tokens with high variation in a given alert.

## Token masking

Tokens that change between events for the same alert can cause that alert to be assigned an incorrectly high entropy value. The most obvious example involves dates and times. If the description of an event is to be analyzed but each event contains a different timestamp, that timestamp will have a high entropy and skew the entropy for that alert as a whole. For other token types that change frequently, such as URLs or IP addresses, it may be desirable to retain the higher entropy associated with that token type because the changing value is significant.

You can configure the Events Analyser to include or exclude specific token types in the entropy analysis for each event partition.

You should consider masking dates, times and numbers from the entropy calculation.

## Language processing techniques

The Events Analyser uses many standard techniques in language processing:

• Case folding

• Tokens that differ only by case, for example, 'WORD', 'Word' or 'word', are converted to the same case and considered equal.

• Case folding is applied to all token types.

• Stop words

• You can add common or meaningless words, such as 'a', 'be', 'not', to a stop words file so that they are removed from the entropy calculation.

• You can define a universal 'length' parameter so that any word at or below a certain length is treated as a stop word. For example, if set to '2', any words of one or two characters are ignored.

• Stop words are applied to all token types.

• Stemming

• A technique used to reduce a word to its root to remove plurals or different tenses in verbs. Words with the same root are considered equal.

• Note that some words, when stemmed, look unusual. For example, 'priority', 'priorities', prioritize, get stemmed to 'priorit'.

• If stemming is enabled, the stemmed form is stored in the reference database.

• Stemming is only applied to tokens of type 'word', that is, it is not applied to numbers, GUIDs, IP addresses, etc.

## Priority words

Priority words are similar in concept to stop words but, rather than removing that word from the analysis as occurs with stop words, a priority word is assigned an entropy value of 1. For example, if ‘reboot’ is defined as a priority word, any tokens containing the word ‘reboot’ are given an entropy value of 1 regardless of how frequently the word appears in events.

### Note

• Priority words are analyzed after stop words. If a token satisfies the criteria of a stop word, it is removed from the analysis and so cannot subsequently be considered as a priority word.

• The reference database contains the calculated entropies for all tokens regardless of whether they are classed as priority words.

## Token variation threshold

Token variation threshold analysis involves the different forms of each field and how the tokens in those different forms vary between events in the same alert. This is most easily explained by an example. Assume that all token masking is off and that an alert consists of the following six events:

QDepth beyond 90% threshold on host = 22222

QDepth beyond 90% threshold on host = 44444

QDepth beyond 90% threshold on host = 44444

QDepth beyond 90% threshold on host = 11111

QDepth beyond 90% threshold on host = 44444

The value for the host is changing between events, there are three occurrences of 44444 and one occurrence of each of the other values. Values that appear infrequently can skew the entropy value for the alert. In order to prevent this skewing, you can apply a threshold. The threshold is a ratio between 0 and 1, where 0 implies that a token can appear only once and still contributes to the entropy calculation, while a value of 1 implies that the value must be the same in every event before it is considered. If the threshold is set to 0.5, the value 44444 would contribute to the entropy, but the values 11111 and 22222 would not, because only the value 44444 appears in half of the events in the alert.

The Events Analyser performs this analysis for each form of each field within each event of every alert.

This configuration option has no effect unless the Events Analyser uses the EntropyClassic algorithm. The EntropyV2 algorithm is more robust to small variations in the wording, and variations in the metadata such as IP addresses and timestamps, so there is no need to have a manual parameter to tune this.