Setting up High-Availability active-passive with Keepalived

[ 1 Overview ] [ 2 Failover use cases ] [ 3 Prerequisites ] [ 4 Step-by-step instructions ] [ 5 Install the Keepalived package on both relays ]

Note that these instructions apply only to relay installations that do not use docker.

Overview

A High-Availability (HA) active-passive relay deployment requires two identical relays: a primary (master) and a secondary (backup) relay.

The secondary relay will act as a backup of the primary (master) relay and will continue to process events if the primary (master) relay fails.

Switching from one relay to the other is done using a virtual IP which will point to one of the 2 relays.

Failover use cases

Total server outage: the server/machine goes down
Keepalived offline
The relay process goes down
The relay process is not listening on port 13000
The relay cannot connect with the configured collector

Prerequisites

To deploy the HA configuration, you will need:

Two relays in the same network. These must be different relays installed on different machines with different names, but both must have the same configuration and rules. You can use the Clone configuration from option to keep both relays' configurations in sync. Learn more about this option in Working on the list of relays.
3 IP addresses. One for each relay and a virtual IP associated with the active relay.
The router software Keepalived. This software will monitor the activity of the relays and trigger failover when needed.

Step-by-step instructions

This guide is designed for setup on Ubuntu and CentOS operating system machines.

Identify the primary and secondary relays

You are starting this process with two identical relays already set up. Later steps in this process require you to distinguish between the primary relay and the secondary (failover) relay. So, to begin, be sure it is clear which is which.

Install the Keepalived package on both relays

Use the corresponding commands according to your OS:

Configure Keepalived

Create Keepalived config file on both relays

Create a new /etc/keepalived/keepalived.conf file in each of the nodes copying the following content:

Configuration example for the primary node

global_defs {
enable_script_security
script_user devo
}

vrrp_script chk_relay {
script "/etc/keepalived/chk_relay.sh"
interval 5
weight 2
fall 1
}

vrrp_instance ng-relay {
state MASTER
interface ens33
virtual_router_id 1
unicast_src_ip 172.27.232.211 #edit this
unicast_peer {
172.27.232.212 #edit this
}

priority 100
advert_int 1
authentication {
auth_type PASS #edit if desired
auth_pass this_is_the_password_for_ha1_ha2_comms #edit this
}
track_script {
chk_relay
}
virtual_ipaddress {
172.27.232.210 brd 172.27.232.255 dev ens33 #edit this
}
}

Configuration example for the secondary node

global_defs {
enable_script_security
script_user devo
}

vrrp_script chk_relay {
script "/etc/keepalived/chk_relay.sh"
interval 5
weight 2
fall 1
}

vrrp_instance ng-relay {
state BACKUP
interface ens33
virtual_router_id 1
unicast_src_ip 172.27.232.212 #edit this
unicast_peer {
172.27.232.211 #edit this
}

priority 99
advert_int 1
authentication {
auth_type PASS #edit if desired
auth_pass this_is_the_password_for_ha1_ha2_comms #edit this
}
track_script {
chk_relay
}
virtual_ipaddress {
172.27.232.210 brd 172.27.232.255 dev ens33 #edit this
}
}

Then edit the lines that include #edit this considering this information:

Parameter	Description

Parameter	Description
`state`	Enter `MASTER` for the primary node and `BACKUP` for the secondary.
`interface`	The networking ID for the relay. For example, `eth0`
`unicast_src_ip`	The static IP address of one node. For example, `10.0.2.15`
`unicast_peer`	The static IP address of the other node. For example, `10.0.2.16`
`virtual_ipaddress`	The virtual IP address that devices will send to and the IP address Devo will see when data is sent. This must be the same on both relays. It takes the following format: `VIRTUAL_IP brd IP_BROADCAST dev INTERFACE 10.0.2.252 brd 10.0.2.255 dev eth0`
`auth_pass`	The password must be the same for both relays. We recommend changing the default password.

Configure chk_relay.sh on both relays

Create the /etc/keepalived/chk_relay.sh file by downloading it below and copying it to the relay machine.

Then, run this command to grant execute permissions for this script to all users:

In order to be able to run the file chk_relay.sh some dependent packages have to be installed if not available yet. Install gawk and lsof. Use the corresponding command according to your OS:

Restart Keepalived with the new configuration

With the new keepalived.conf files and the scripts set up on both relays, you need to restart the Keepalived service in each relay to activate the new configuration.

It is also important to enable the service so that it starts automatically if the machine is rebooted.

Confirm that both relays have registered the virtual IP

This simply confirms that the keepalived.conf file is correct and is being read properly.

Run this command on the primary relay:

You should see the primary relay IP and the virtual IP in the command response. If you don't, review the .conf file to be sure the changes were saved.

Now, stop the relay in the first node.

This turns the primary into the backup relay, making the secondary relay the "master" relay.

Check that now the virtual IP belongs to the secondary node.

Now, start again the relay in the primary node.

When you do that, the primary relay will be again acting as the master relay.

Send test data to the virtual IP

We can use the netcat utility to send 100 events to the virtual IP in order to confirm that it is enabled for the relays.

Testing High-Availability

Now, perform the following tests to make sure the High-Availability configuration is functioning correctly.

Host outage / Keepalived failure

On the primary node, stop the Keepalived service:

On the secondary node, you can check that the secondary relay becomes the primary one:

You will be able to see that your events are still being ingested into the platform.

The relay is down

On the primary node, stop the relay:

This turns the primary into the backup node, making the secondary relay the "master" node. You can check that now the virtual IP belongs to the other node.