High Availability Overview

Moogsoft AIOps supports high availability (HA) architectures to make the system more fault tolerant. Each component supports a multi-node architecture to enable redundancy, failover, or both to minimize risk of data loss. For example, in the case of a hardware failure.

This topic covers the architectures you can use to achieve HA with Moogsoft AIOps. For an example of how to set up a single site HA system, see Single Site High Availability. See HA Reference Architecture for a detailed diagram of the components in a single site HA configuration.

HA Architecture Basics

The following diagram illustrates a single site scenario for HA using two full stacks of Moogsoft AIOps, a primary cluster and a secondary cluster:

ha.png

In its simplest form, each cluster runs on a designated host and the two hosts constitute an HA pair. The third party components Elasticsearch and RabbitMQ require an additional node to achieve HA. The database currently runs in a two-node master > standby pair. The gray database node is there for planning for upcoming performance and resilience enhancements to the Moogsoft AIOps architecture.

Distributed HA Architectures

Moogsoft AIOps supports high availability in distributed architectures where different machines host a subset of the stack. The dotted lines in the diagram illustrate one way to divide clusters into a distributed architecture. For example you can separate the UI stack (A) from the database (B) and the database from the remaining layers (C). Alternatively, you can run LAMs for data ingestion on separate hosts.

When choosing how to distribute the components of your HA deployment onto multiple hosts, you must run the following components on the same host:

  • The UI stack (Nginx and Apache Tomcat) .

  • Data processing (Moogfarmd), the Message Bus (RabbitMQ), and search (Elasticsearch).

The database and LAMs can run on hosts with no other components. There should be no more than 5 ms transaction latency between Moogfarmd and the database.

To increase capacity within the HA architecture, you can:

  • Scale Moogfarmd vertically. For example, add memory or CPU to hosts running Moogfarmd to alleviate resource contention.

  • Scale the LAM and UI servers horizontally. For example, add low-cost servers to increase LAM processing capacity. When there is a backlog of incoming events because a single instance of a LAM is unable to process the incoming event load, you can increase the thread count for the LAM or provision another instance of the LAM and send a subset of the event stream to it.

You should contact your Moogsoft technical representative to discuss scaling your deployment.

See Sizing Recommendations for more information on hardware sizes and capacity.

After you decide on the best HA architecture for your environment, you can plan your implementation.

Resilience and Failover

Moogsoft AIOps provides support for automatic failover between the two nodes within an HA pair. For example from one instance of Moogfarmd to another, or from one instance of a LAM to another. However there is no automatic failover between multiple HA pairs. For example, there is no failover from a primary site to a second site, such as a disaster recovery replica.

Moogsoft AIOps does not support automated fail-back for any architecture. For example, consider an HA pair of Moogfarmd instances. When the instance of Moogfarmd in cluster 1 becomes unavailable, the instance in cluster 2 enters an active state. When the instance from cluster 1 recovers and becomes available, the instance in cluster 2 remains active.