Monitor and Troubleshoot Moogsoft AIOps

This document details the available health and performance indicators included with the Moogsoft AIOps system. It also provides some guidance on how to monitor your system and how to troubleshoot performance problems.

Processing Metrics

Navigate to System Settings > Self Monitoring > Processing Metrics to see a breakdown of the current state of the system based on the metrics received from the running components.

  • The Moogfarmd process and all LAMs publish detailed performance information.

  • A bullet chart at the top of the page shows the key performance metric for the system: Current Maximum Event Processing Time. The defined performance ranges are color coded: good (green), marginal (yellow) and bad (red). As the metric changes the bullet chart updates to reflect good, marginal or bad performance.

  • The system calclulates Current Maximum Event Processing Time as the approximate 95th percentile of the current maximum time in seconds that it takes for an event to make its way through the system from its arrival at a LAM until its final processing by a Moolet in Moogfarmd.

  • By default, AlertBuilder, AlertRulesEngine and All Sigalisers are used to calculate the Current Maximum Event Processing Time metric.

  • You can configure the metric_path_moolet property in moog_farmd.conf to specify the Moolets to use to calculate Current Maximum Event Processing Time.

  • By default, the good, marginal and bad ranges of the bullet chart are set to 0-10secs, 10-15secs and 15-20secs respectively. You can change the configuration in the in the eventProcessingTimes section in the portal block of $MOOGSOFT_HOME/ui/html/web.conf.

Good performance means LAMs are consuming and publishing events without problem as indicated by:

  • Message Queue Size is 0.

  • Socket Backlog (if relevant) is not increasing.

Additionally, Moogfarmd is consuming and processing events successfully as indicated by all of:

    • Total Abandoned Messages is 0 for the majority of the time.

    • Asynchronous Task Queue Size is 0 for the majority of the time.

    • Cookbook Resolution Queue is 0 for the majority of the time.

    • Message backlogs for all Moolets is 0 for the majority of the time.

    • The Messages Processed count for all running Moolets should be the same (unless custom configuration causes event routing through different Moolets) i.e. no Moolet is falling behind.

The above should lead to a stable low Current Maximum Event Processing Time depending on the complexity of the system.

Marginal or Bad performance means LAMs are not consuming and publishing events at the rate at which they receive them, as indicated by:

    • Message Queue Size is > 0 and likely increasing.

    • Socket Backlog is increasing.

Additionally, Moogfarmd is not consuming and processing events in a timely fashion as indicated by some or all of:

    • Total Abandoned Messages is constantly > 0 and likely increasing.

    • Asynchronous Task Queue Size is > 0 and likely increasing.

    • Cookbook Resolution Queue is constantly > 0 and likely increasing.

    • Message backlogs for all Moolets is constantly > 0 and likely increasing.

    • The Messages Processed count for all running Moolets is not the same indicating that some Moolets are falling behind. This doesn't apply for cases where custom configuration causes event routing through different Moolets.

The above will likely lead to an unstable high Current Maximum Event Processing Time depending on the complexity of the system.

See Self Monitoring for more detail.

Graze getSystemStatus Endpoint

The getSystemStatus endpoint returns useful information about running processes within the system. For example:

curl -u graze:graze -k "https://localhost/graze/v1/getSystemStatus"

See Graze API for more detail.

Moogfarmd Health Logging

Moogfarmd writes detailed health information in JSON format to its log file once a minute. Information falls into five logical blocks:

  • totals: running totals since Moogfarmd was started.

  • interval_totals: running totals since the last 60 second interval)

  • current_state: a snapshot of the important queues in Moogfarmd

  • garbage_collection: JVM garbage collection data

  • JVM_memory: JVM memory usage data

  • message_queues: Queue usage and capacity

Example output:

WARN : [HLog   ][20180510 20:39:55.538 +0100] [CFarmdHealth.java]:533 +|{"garbage_collection":{"total_collections_time":12827,"last_minute_collections":0,"last_minute_collections_time":0,"total_collections":1244},"current_state":{"pending_changed_situations":0,"total_in_memory_situations":4764,"situations_for_resolution":0,"event_processing_metric":0.047474747474747475,"message_queues":{"AlertBuilder":0,"TeamsMgr":0,"Housekeeper":0,"Indexer":0,"bus_thread_pool":0,"Cookbook3":0,"Cookbook1":0,"SituationMgr":0,"SituationRootCause":0,"Cookbook2":0},"in_memory_entropies":452283,"cookbook_resolution_queue":0,"total_in_memory_priority_situations":0,"active_async_tasks_count":0},"interval_totals":{"created_events":1782,"created_priority_situations":0,"created_external_situations":0,"created_situations":10,"messages_processed":{"TeamsMgr":182,"Housekeeper":0,"AlertBuilder":1782,"Indexer":2082,"Cookbook3":1782,"SituationRootCause":172,"Cookbook1":1782,"SituationMgr":172,"Cookbook2":1782},"alerts_added_to_priority_situations":0,"alerts_added_to_situations":111,"situation_db_update_failure":0},"JVM_memory":{"heap_used":1843627096,"heap_committed":3007840256,"heap_init":2113929216,"nonheap_committed":66912256,"heap_max":28631367680,"nonheap_init":2555904,"nonheap_used":64159032,"nonheap_max":-1},"totals":{"created_events":453252,"created_priority_situations":0,"created_external_situations":0,"created_situations":4764,"alerts_added_to_priority_situations":0,"alerts_added_to_situations":36020,"situation_db_update_failure":0}}|+

The message_queues block contains string values and queue limits. "-" represents an unlimited queue. An example message_queues block is as follows:

"message_queues":{"AlertBuilder":"0/-","Cookbook":"0/-","Housekeeper":"0/-","Indexer":"0/-","bus_thread_pool":"0/-","SituationMgr":"0/-"}

In a healthy system that is processing data:

  • The count of created events and created Situations should increase.

  • The messages_processed should show that Moolets are processing messages.

  • The current_state.message_queues should not be accumulating (there may be spikes).

  • The total_in_memory Situations should increase over time but will reduce periodically due to the retention_period.

  • The situation_db_update_failure should be zero.

Tomcat Servlet Logging

Tomcat writes counter information from each of the main servlets to its catalina.out once a minute.

Example output:

WARN : [Thread-][20180510 20:57:05.501 +0100] [CReporterThread.java]:136 +|MoogPoller read [16722] MooMs messages in the last [60] seconds.|+
WARN : [Thread-][20180510 20:57:07.169 +0100] [CReporterThread.java]:136 +|Graze handled [55] requests in the last [60] seconds.|+
WARN : [Thread-][20180510 20:57:10.181 +0100] [CReporterThread.java]:136 +|MoogSvr handled [86] requests in the last [60] seconds.|+
WARN : [Thread-][20180510 20:58:03.197 +0100] [CReporterThread.java]:136 +|Situations similarity component calculated similarities for [264] situations in the last [60] seconds.|+

The counters are:

  • Number of MoogSvr requests in the last minute (i.e. number of standard UI requests made).

  • Number of Moogpoller MooMs messages in the last minute (i.e. number of messages read from the bus).

  • Number of Graze requests in the last minute.

  • Number of similar Situations calculated in the last minute.

In a healthy system that is processing data:

  • The Moogpoller count should always be non-zero.

  • The MoogSvr and Graze counters may be zero, but should reflect the amount of UI and Graze activity.

  • The similar Situations counter may be zero but should reflect the number of similar Situations that are occurring in the system.

Database Pool Diagnostics

Moogsoft AIOps features a capability to print out the current state of the DBPool in both Moogfarmd and Tomcat. This can be very useful to diagnose problems with slow event processing or UI response.

To trigger logging, run the ha_cntl utility and pass the cluster name using the -i argument. For example:

ha_cntl -i MOO 
This will perform task "diagnostics" all groups within the [MOO] cluster.
Diagnostics results will be in the target process log file.
Are you sure you want to continue? (y/N)

The utility triggers logging to /var/log/moogsoft/moogfarmd.log. For example in a performant system:

WARN : [pool-1-][20180511 10:06:07.690 +0100] [CDbPool.java]:792 +|[farmd] DATABASE POOL DIAGNOSTICS:|+
WARN : [pool-1-][20180511 10:06:07.690 +0100] [CDbPool.java]:793 +|[farmd] Pool created at [20180510 17:54:48.911 +0100].|+
WARN : [pool-1-][20180511 10:06:07.690 +0100] [CDbPool.java]:797 +|[farmd] [2] invalid connections have been removed during the lifetime of the pool.|+
WARN : [pool-1-][20180511 10:06:07.690 +0100] [CDbPool.java]:833 +|[farmd] Pool size is [10] with [10] available connections and [0] busy.|+

It also triggers logging to /usr/share/apache-tomcat/logs/catalina.out. For example:

WARN : [0:MooMS][20180511 10:06:07.690 +0100] [CDbPool.java]:792 +|[SituationSimilarity] DATABASE POOL DIAGNOSTICS:|+
WARN : [0:MooMS][20180511 10:06:07.690 +0100] [CDbPool.java]:793 +|[SituationSimilarity] Pool created at [20180510 17:55:04.262 +0100].|+
WARN : [3:MooMS][20180511 10:06:07.690 +0100] [CDbPool.java]:792 +|[MoogPoller] DATABASE POOL DIAGNOSTICS:|+
WARN : [3:MooMS][20180511 10:06:07.690 +0100] [CDbPool.java]:793 +|[MoogPoller] Pool created at [20180510 17:55:01.990 +0100].|+
WARN : [0:MooMS][20180511 10:06:07.690 +0100] [CDbPool.java]:833 +|[SituationSimilarity] Pool size is [5] with [5] available connections and [0] busy.|+
WARN : [3:MooMS][20180511 10:06:07.691 +0100] [CDbPool.java]:833 +|[MoogPoller] Pool size is [10] with [10] available connections and [0] busy.|+
WARN : [1:MooMS][20180511 10:06:07.693 +0100] [CDbPool.java]:792 +|[ToolRunner] DATABASE POOL DIAGNOSTICS:|+
WARN : [1:MooMS][20180511 10:06:07.694 +0100] [CDbPool.java]:793 +|[ToolRunner] Pool created at [20180510 17:55:00.183 +0100].|+
WARN : [1:MooMS][20180511 10:06:07.694 +0100] [CDbPool.java]:792 +|[MoogSvr : priority] DATABASE POOL DIAGNOSTICS:|+
WARN : [1:MooMS][20180511 10:06:07.694 +0100] [CDbPool.java]:833 +|[ToolRunner] Pool size is [5] with [5] available connections and [0] busy.|+
WARN : [1:MooMS][20180511 10:06:07.694 +0100] [CDbPool.java]:793 +|[MoogSvr : priority] Pool created at [20180510 17:54:56.800 +0100].|+
WARN : [1:MooMS][20180511 10:06:07.695 +0100] [CDbPool.java]:797 +|[MoogSvr : priority] [5] invalid connections have been removed during the lifetime of the pool.|+
WARN : [1:MooMS][20180511 10:06:07.695 +0100] [CDbPool.java]:833 +|[MoogSvr : priority] Pool size is [25] with [25] available connections and [0] busy.|+
WARN : [1:MooMS][20180511 10:06:07.695 +0100] [CDbPool.java]:792 +|[MoogSvr : normal priority] DATABASE POOL DIAGNOSTICS:|+
WARN : [1:MooMS][20180511 10:06:07.695 +0100] [CDbPool.java]:793 +|[MoogSvr : normal priority] Pool created at [20180510 17:54:56.877 +0100].|+
WARN : [1:MooMS][20180511 10:06:07.695 +0100] [CDbPool.java]:833 +|[MoogSvr : normal priority] Pool size is [50] with [50] available connections and [0] busy.|+

In both of these examples, the connections are "available" and none show as busy. However, in a busy system with flagging performance, the Moogfarmd log will show different results. In the example below, all connections are busy and have been held for a long time. This type of critical issue causes Moogfarmd to stop processing:

WARN : [pool-1-]20180309 16:49:30.031 +0000] [CDbPool.java]:827 +|[farmd] Pool size is [10] with [0] available connections and [10] busy.|+
WARN : [pool-1-][20180309 16:49:30.031 +0000] [CDbPool.java]:831 +|The busy connections are as follows:
1: Held by 5:SituationMgrLOGFILECOOKBOOK for 173603 milliseconds. Checked out at [CArchiveConfig.java]:283.
2: Held by 7:SituationMgrSYSLOGCOOKBOOK for 173574 milliseconds. Checked out at [CArchiveConfig.java]:283.
3: Held by 8:SituationMgrSYSLOGCOOKBOOK for 173658 milliseconds. Checked out at [CArchiveConfig.java]:283.
4: Held by 9:SituationMgrSYSLOGCOOKBOOK for 173477 milliseconds. Checked out at [CArchiveConfig.java]:283.
5: Held by 8:TeamsMgr for 173614 milliseconds. Checked out at [CArchiveConfig.java]:283.
6: Held by 4:SituationMgrSYSLOGCOOKBOOK for 173514 milliseconds. Checked out at [CArchiveConfig.java]:283.
7: Held by 5:PRC Request Assign - SituationRootCause for 173485 milliseconds. Checked out at [CArchiveConfig.java]:283.
8: Held by 2:SituationMgrSYSLOGCOOKBOOK for 173661 milliseconds. Checked out at [CArchiveConfig.java]:283.
9: Held by 6:SituationMgrSYSLOGCOOKBOOK for 173631 milliseconds. Checked out at [CArchiveConfig.java]:283.
10: Held by 6:TeamsMgr for 172661 milliseconds. Checked out at [CArchiveConfig.java]:283.|+

It is expected that occasionally some of the connections will be busy but as long as they are not held for long periods of time then the system will be functioning normally.

You can use the following bash script to automatically gather DBPool diagnostics:

#!/bin/bash

#Get the cluster name
CLUSTER=$($MOOGSOFT_HOME/bin/utils/moog_config_reader -k ha.cluster)

#Get the current line numbers of latest log lines
FARMLINES=$(wc -l /var/log/moogsoft/moogfarmd.log|awk '{print $1}')
TOMLINES=$(wc -l /usr/share/apache-tomcat/logs/catalina.out|awk '{print $1}')

#Run ha_cntl -i <cluster>
ha_cntl -i $CLUSTER -y > /dev/null

sleep 5

#Print the results
echo "moog_farmd:"
tail -n +$FARMLINES /var/log/moogsoft/moogfarmd.log|egrep "CDbPool|Held by"

echo "tomcat:"
tail -n +$TOMLINES /usr/share/apache-tomcat/logs/catalina.out|egrep "CDbPool|Held by"

To run the script, execute the following command:

./get_dbpool_diag.sh

MySQL Slow Query Logging

Slow query logging captures long running queries that are impacting the database. You can enable the feature as follows:

  1. Check the current settings in MySQL for the feature:

    mysql> show variables like '%slow_query_log%';
    +---------------------+-------------------------+
    | Variable_name       | Value                   |
    +---------------------+-------------------------+
    | slow_query_log      | OFF                     |
    | slow_query_log_file | /var/log/mysql-slow.log |
    +---------------------+-------------------------+
    2 rows in set (0.00 sec)
    
    mysql> show variables like 'long_query%';
    +-----------------+-----------+
    | Variable_name   | Value     |
    +-----------------+-----------+
    | long_query_time | 10.000000 |
    +-----------------+-----------+
    1 row in set (0.00 sec)
  2. Ensure that the file specified in the slow_query_log_file setting exists and has the correct permissions. If not create it/set permissions :

    touch /var/log/mysql-slow.log
    chown mysql:mysql /var/log/mysql-slow.log
  3. Enable the slow query log:

    mysql> set global slow_query_log=on;

    After that, queries that take longer than 10 seconds to execute will appear in the log file. For example:

    /usr/sbin/mysqld, Version: 5.7.19 (MySQL Community Server (GPL)). started with:
    Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock
    Time Id Command Argument
    # Time: 2018-02-02T19:12:20.822756Z
    # User@Host: ermintrude[ermintrude] @ localhost [127.0.0.1] Id: 98
    # Query_time: 55.161516 Lock_time: 0.000025 Rows_sent: 1 Rows_examined: 25878591
    use moogdb;
    SET timestamp=1517598740;
    SELECT COALESCE(MIN(GREATEST(last_state_change,last_event_time)), UNIX_TIMESTAMP(SYSDATE())) as oldest FROM alerts WHERE alerts.alert_id NOT IN (SELECT sig_alerts.alert_id FROM sig_alerts);
    # Time: 2018-02-02T19:13:19.255277Z
    # User@Host: ermintrude[ermintrude] @ localhost [127.0.0.1] Id: 98
    # Query_time: 56.131108 Lock_time: 0.000028 Rows_sent: 515 Rows_examined: 25878591
    SET timestamp=1517598799;
    SELECT alerts.alert_id FROM alerts WHERE alerts.alert_id NOT IN (SELECT sig_alerts.alert_id FROM sig_alerts) AND GREATEST(last_state_change,last_event_time) BETWEEN 1417547486 AND 1417633885;

The value of the long_query_time setting can also be adjusted up or down as suits.

RabbitMQ Admin UI

RabbitMQ includes an admin UI that gives performance information about the message bus. By default this is accessible via http://<hostname>:15672 with credentials moogsoft/m00gs0ft. Check for the following scenarios:

  • Early warning of any system resource issues in the "Nodes" section on the Overview page. For example, file/socket descriptors, Erlang processes, memory and disk space.

  • Build-up of "Ready" messages in a queue - this indicates a message queue is forming. This means that the associated Moogsoft AIOps process is not consuming messages from this queue. It could also point to an orphaned queue that no longer has an associated consumer. This could happen if "message_persistence" has been enabled in system.conf and Moogfarmd and or Tomcat has been reconfigured with a different HA process group name.

See the RabbitMQ docs for information on how to use the admin UI.

Other Utilities

MySQLTuner provides useful diagnostics and recommendations on MySQL settings. See MySQLTuner for more information.

To monitor the CPU and memory usage of the running components of a Moogsoft AIOps system, you can run the following script that offers simple CPU and memory monitoring of the RabbitMQ, Socket LAM, Moogfarmd, Tomcat and MySQL processes:

#!/bin/bash

SLEEPTIME=$1

f_return_metrics() {

        PROCPID=$1
        TOPOUTPUT=`top -p $PROCPID -n1 | tail -2 | head -1 |sed 's/[^ ]\+\s\(.*\)/\1/g'`
        PROCICPU=`echo $TOPOUTPUT| awk '{print $8}'`
        if [ "$PROCICPU" == "S" ]; then PROCICPU=`echo $TOPOUTPUT| awk '{print $9}'`;fi
        PROCPCPU=`ps -p $PROCPID -o pcpu|tail -1|awk '{print $1}'`
        PROCMEM=`ps -p $PROCPID -o rss|tail -1|awk '{print $1}'`
        echo $PROCICPU,$PROCPCPU,$PROCMEM

}

#Capture PIDs
RABBITPID=`ps -ef|grep beam|grep -v grep|awk '{print $2}'`
LAMPID=`ps -ef|grep socket_lam|grep java|grep -v grep|awk '{print $2}'`
MYSQLPID=`ps -ef|grep mysqld|grep -v mysqld_safe|grep -v grep|awk '{print $2}'`
TOMCATPID=`ps -ef|grep tomcat|grep java|grep -v grep|awk '{print $2}'`
FARMDPID=`ps -ef|grep moog_farmd|grep java|grep -v grep|awk '{print $2}'`

echo "DATE,TIME,RABBITICPU(%),RABBITPCPU(%),RABBITRSS(Kb),LAMICPU(%),LAMPCPU(%),LAMRSS(Kb),FARMDICPU(%),FARMDPCPU(%),FARMDRSS(Kb),TOMCATICPU(%),TOMCATPCPU(%),TOMCATRSS(Kb),MYSQLICPU(%),MYSQLPCPU(%),MYSQLRSS(Kb)"

while [ true ]; do

  DATENOW=`date +"%m-%d-%y"`
  TIMENOW=`date +"%T"`

  RABBITMEAS=$(f_return_metrics $RABBITPID)
  LAMMEAS=$(f_return_metrics $LAMPID)
  FARMDMEAS=$(f_return_metrics $FARMDPID)
  TOMCATMEAS=$(f_return_metrics $TOMCATPID)
  MYSQLMEAS=$(f_return_metrics $MYSQLPID)
  TOMCATMEAS=$(f_return_metrics $TOMCATPID)

  echo "$DATENOW,$TIMENOW,$RABBITMEAS,$LAMMEAS,$FARMDMEAS,$TOMCATMEAS,$MYSQLMEAS"

  sleep $SLEEPTIME

done

Example usage and output:

[root@ldev04 640]# ./perfmon.sh 5
DATE,TIME,RABBITICPU(%),RABBITPCPU(%),RABBITRSS(Kb),LAMICPU(%),LAMPCPU(%),LAMRSS(Kb),FARMDICPU(%),FARMDPCPU(%),FARMDRSS(Kb),TOMCATICPU(%),TOMCATPCPU(%),TOMCATRSS(Kb),MYSQLICPU(%),MYSQLPCPU(%),MYSQLRSS(Kb)
05-10-18,22:44:26,28.0,8.5,203068,2.0,1.0,557092,20.0,13.5,2853408,4.0,2.1,5680584,28.0,17.4,9657152
05-10-18,22:44:34,14.0,8.5,183492,4.0,1.0,557092,16.0,13.5,2850484,0.0,2.1,5680584,33.9,17.4,9657152
05-10-18,22:44:43,0.0,8.5,181072,0.0,1.0,557092,0.0,13.5,2850484,0.0,2.1,5680584,4.0,17.4,9658312
05-10-18,22:44:51,12.0,8.5,181040,0.0,1.0,557092,0.0,13.5,2850484,0.0,2.1,5680584,4.0,17.4,9658312
05-10-18,22:44:59,0.0,8.5,181040,0.0,1.0,557092,0.0,13.4,2850484,0.0,2.1,5680584,0.0,17.4,9658312

Notes:

  • Script only outputs to the console so should be redirected to a file for logging results

  • Output is in csv format.

  • ICPU = "Instantaneous CPU Usage (%)"

  • PCPU = "Percentage of CPU usage since process startup (%)"

  • RSS = "Resident Set Size i.e. Memory Usage in Kb"

  • For CPU measurements a measure of 100% represents all of one processor so results > 100% are achievable for multi-threaded processes.

Troubleshooting Performance Problems

If the system is showing signs of latency in alert or Situation creation then the problem is likely with Moogfarmd and/or the database. The following diagnostic steps will help you track down the cause:

Step

Description

Possible Cause and Resolution

1

Check the Moogfarmd log for any obvious errors or warning.

Cause may be evident from any warnings or errors.

2

Check the Self Monitoring > Processing Metrics Page

If the event_process_metric is large and/or increasing then something is backing up.

Check Moogfarmd health logging also for sign of message_queue build-up in any of the Moolets.

3

Check the CPU/memory usage of the server itself.

If the server, as a whole, is running close to CPU or memory limit and no other issues can be found (e.g. rogue processes or memory leaks in the Moogsoft AIOps components) then consider adding more resource to the server or distributing the Moogsoft AIOps components.

4

Check whether the Moogfarmd java process is showing constant high CPU/memory usage.

Moogfarmd may be processing an event or Situation storm.

Check Moogfarmd health logging also for sign of message_queue build-up in any of the Moolets. Backlog should clear assuming storm subsides.

5

Has the memory of the Moogfarmd java processed reached a plateau?

Moogfarmd may have reached its java heap limit. Check the -Xmx settings of Moogfarmd. If not specified has Moogfarmd reached approximately a quarter of the RAM on the server? Increase the -Xmx settings as appropriate and restart the Moogfarmd service.

6

Is the database tuned?

Check the innodb-buffer-pool-size and innodb_buffer_pool_instances settings in /etc/my.cnf as per Tuning section above. Ensure they are set appropriately and restart mysql if changes are made.

7

Check the server for any other high CPU or memory processes or that which might be impacting the database.

Something may be hogging CPU/memory on the server and starving Moogfarmd of resources.

The events_analyser utility may be running or a sudden burst of UI or Graze activity may be putting pressure on the database and affecting Moogfarmd.

8

Run DBPool Diagnostics (see previous section) several times to assess current state of Moogfarmd to database connections.

Moogfarmd database connections may be maxed out with long running connections - this may indicate a processing deadlock - perform a kill -3 <pid> on the Moogfarmd java process to generate a thread dump (in the Moogfarmd log) and send it to Moogsoft Support.

Alternatively Moogfarmd may be very busy with lots of short but frequent connections to the database. Consider increasing the number DBPool connections for Moogfarmd by increasing the top-level "threads" property in the Moogfarmd configuration file and restarting the Moogfarmd service.

9

Turn on MySQL slow query logging (see earlier section on how to do this)

Slow queries from a Moobot in Moogfarmd may be causing problems and they should be reviewed for efficiency.

Alternatively slow queries from other parts of the system may be causing problems (e.g. nasty UI filters).

Slow queries may also be down to the sheer amount of data in the system. Consider enabling Database Split to move old data and/or using the Archiver to remove old data.

10

Check Moogfarmd Situation resolution logging using:

grep "Resolve has been running for" /var/log/moogsoft/moogfarmd.log

If this logging shows non-zero upward trend in "Resolve" time then Moogfarmd is struggling with the number of "in memory" Situations for its calculations.

Check the Moogfarmd health logging for the current count of "in memory" situations and consider reducing the retention_period setting in the Moogfarmd log (will need a Moogfarmd restart) and/or closing more old Situations.

11

Is Moogfarmd memory constantly growing over time and a memory leak is suspected?

Note that Moogfarmd memory does typically increase for periods of time then is trimmed back via Java garbage collection and Sigaliser memory purge (via retention_period property).

Take periodic heap dumps from the Moogfarmd java process and send them to Moogsoft Support so they can analyse the growth. Use the following commands:

DUMPFILE=/tmp/farmd-heapdump-$(date +%s).bin
sudo -u moogsoft jmap -dump:format=b,file=$DUMPFILE $(ps -ef|grep java|grep moog_farmd|awk '{print $2}')
bzip2 $DUMPFILE

Notes:

  • jmap needs java jdk to be installed. "yum install jdk" should suffice to install this.

  • generating a heap dump is like to make the target process very busy for a period of time and also triggers a garbage collection so the memory usage of the process may well reduce.

  • heapdump files may be very large.

If the system is showing signs of slow UI performance, such as long login times, spinning summary counters, or other, then the problem is likely with Tomcat and/or the database. The following diagnostic steps will help you track down the cause:

Step

Description

Possible Cause and Resolution

1

Check catalina.out for any obvious errors or warning.

Cause may be evident from any warnings or errors.

2

Check browser console or any errors or timing out requests.

Possibly a bug or more likely that the query to the database associated with the request is taking longer that 30secs (the default browser timeout). Root cause of this should be investigated.

3

Check network latency between browser client machine and server using ping.

Latency of =>100ms can make login noticeably slower.

4

Check the CPU/memory usage of the server itself.

If the server, as a whole, is running close to CPU or memory limit and no other issues can be found (e.g. rogue processes or memory leaks in the Moogsoft AIOps components) then consider adding more resource to the server or distributing the Moogsoft AIOps components.

5

Check MoogSvr/Moogpoller/Graze counter logging in catalina.out

Tomcat may be processing a high number of requests or bus updates.

If Moogpoller count is zero then something may be wrong with Tomcat > RabbitMQ connection. Check RabbitMQ admin UI for signs of message queue build-up.

6

Check whether Tomcat java process is showing constant high CPU/memory usage.

Tomcat may be processing the updates from an event or situation storm. Backlog should clear assuming storm subsides.

7

Has the memory of the Tomcat java processed reached a plateau?

Tomcat may have reached its java heap limit. Check the -Xmx setting in /etc/init.d/apache-tomcat.

Increase the -Xmx settings as appropriate and restart the apache-tomcat service.

8

Is the database tuned?

Check the innodb-buffer-pool-size and innodb_buffer_pool_instances settings in /etc/my.cnf as per Tuning section above. Ensure they are set appropriately and restart mysql if changes are made.

9

Check the server for any other high CPU or memory processes or that which might be impacting the database.

Something may be hogging CPU/memory on the server and starving Tomcat of resources.

The Events Analyser utility may be running or a sudden burst of Moogfarmd or Graze activity may be putting pressure on the database and affecting the UI.

10

Run DBPool Diagnostics (see previous section) several times to assess current state of Tomcat > Database connections.

Tomcat database connections may be maxed out with long running connections - this may indicate a processing deadlock - perform a kill -3 <pid> on the Tomcat java process to generate a thread dump (in catalina.out) and send it to Moogsoft AIOps Support.

Alternatively Tomcat may be very busy with lots of short but frequent connections to the database. A Graze request bombardment is another possibility (Graze does not currently have a separate DB Pool). Consider increasing the number DBPool connections for Tomcat by increasing the related properties in servlets.conf and restarting the apache-tomcat service.

11

Turn on MySQL slow query logging (see earlier section on how to do this)

Slow queries from nasty filters in the UI may be causing problems and they should be reviewed for efficiency.

Alternatively slow queries from other parts of the system may be causing problems (e.g. inefficient Moobot code).

Slow queries may also be down to the sheer amount of data in the system. Consider enabling Database Split to move old data and/or using the Archiver to remove old data.

12

Is Tomcat memory constantly growing over time and a memory leak is suspected?

Note that Tomcat memory does typically increase for periods of time then is trimmed back via java garbage collection.

Take periodic heap dumps from the Tomcat java process and send them to Moogsoft support so they can analyse the growth. Use the following commands:

DUMPFILE=/tmp/tomcat-heapdump-$(date +%s).bin
sudo -u tomcat jmap -dump:format=b,file=$DUMPFILE $(ps -ef|grep java|grep tomcat|awk '{print $2}')
bzip2 $DUMPFILE

Notes:

  • jmap needs Java JDK to be installed. "yum install jdk" should suffice to install this.

  • generating a heap dump is likely to make the target process very busy for a period of time and also triggers a garbage collection so the memory usage of the process may well reduce.

  • heapdump files may be very large