How Can We Failover SlapOS Master?

Page Content

It depends the kind of SLA you expect.

First keep in mind that if SlapOS Master dies, everything else runs. It is not a big problem if SlapOS Master is down.

Then there are two topics:

DR - disaster recovery (most important)
HA - high availability (less important)

We implement disaster recovery using two procedures, depending on the database size

if DB < 1 TB, then we use the "resiliency" stack of SlapOS to keep copies of the master in 2 or 3 places in the world
if DB > 1 TB, we use NEO technology to have one production cluster on Site A and two clone clusters on Site B/Site C. The cluster are close to "in sync" (a few seconds difference)

To implement high availability:

either we use Linbit's drbd for small datasets (< 1 TB)
or we use NEO a distributed redundante transactonal object database. NEO can be deployed on a cluster to achieve HA.

We also use MariaDB as database. To achieve DR + scalability, we use "Repman" by Signal18, a tool which automated the configuration of MariaDB clusters, including replicas for DR and for reporting.

Regarding MariaDB HA, we are not sure what really works (without losing transactions). Both drbd or MariaDB Galera are too slow. This is why we prefer having a "recovery script"to ensure that NEO and MariaDB are in sync in case of inconsistency.