Disaster Recovery

For every SaaS deployment of Moogsoft AIOps, Moogsoft takes every measure to prevent an outage. This topic details our SaaS disaster prevention measures along with our disaster recovery plan in the unlikely case of an outage.

Preventative Measures

All SaaS deployments in the cloud have a High Availability (HA) configuration. We locate each instance within each HA pair in different availability zones. This allows us to manage a single site outage in any of the cloud provider data centers. It also allows us to maintain the data and system processing configuration, and keep them intact during the outage.

Moogsoft AIOps and the Database use automatic failover. If one instance of either of these processes fails, the system handles failover on its own, without human intervention. If services in one availability zone go down, the services continue to run in the second availability zone. When the failed availability zone comes back online services resume full HA capacity. Moogsoft AIOps vertical scaling model means that you don't experience a degradation of service when a single availability zone is offline.

Backups

We store all configurations and configuration changes in GitHub. All configurations remain intact even in the case of loss of service.

We use infrastructure as code to build out the infrastructure for deployments. This allows us to restore service quickly in the event of infrastructure loss. We store our infrastructure in Git and update it during maintenance changes.

Database Restoration

In the event that the database goes offline, we use the latest five daily snapshots when we restore service. The time it takes to restore a database depends on the database size and the cloud provider, but we estimate the process taking up to two hours. During a database outage you cannot access the Moogsoft AIOps UI or ingest events. We work to restore the database as quickly as possible within the constraints of the snapshots available from the cloud provider.

Cluster/Infrastructure Restoration

The only times when we should need to rebuild infrastructure are in the case of a cloud provider outage or security breach or the cloud providers suffer an outage. We use Terraform to rebuild. Generally the process takes about an hour. We pull all configurations from the most recent serviceable checkpoint in Git to prevent configuration loss.

A full restore from our Terraform repo generally takes longer than an infrastructure migration. To minimize the outage time and data loss in these types of cases, we work with the cloud provider to restore the original infrastructure before attempting a full restore.