Why 3-node Redundancy is the “New Normal”

Bruno R Neves
3 min readFeb 20, 2021

For many years it was thought that building redundancy with a pair of systems was enough to guarantee high availability of systems. In fact, this approach works well for many implementations and systems to date, in particular in systems that members of a cluster can work independently from each other.

For some situations, however, it may not be the perfect solution, specially if the members of the cluster regularly exchange information among each other to keep the system running as it "is supposed to". Let’s take for example IBM API Connect, the members of the cluster must know the status of many objects residing in each of the members. Among other things, they need to be in touch with each other to ensure that rate limiting thresholds are being honored as an example.

Imagine now you have a cluster with only two members. These members need to communicate with each other to ensure that neither of them are allowing more API requests to be processed than what is allowed by the consumption Plan definition. If, for whatever reason, the members lose connectivity to each other, but not with the network serving the incoming requests, both members will keep processing requests without knowing the status of the other member, eventually resulting in over consumption by the API consumers. This situation may also eventually result in downtime of downstream servers that are not prepared to serve an increased amount of requests. This desynchrony of the cluster members is defined by the IT industry as “split-brain”. Once the connectivity between the two members of the cluster is reestablished, conflicting information will be present on both.

To solve the split-brain situation the “quorum” design was introduced in the industry. In this design, clusters must always have an odd number of members minimally starting with 3. The members of the cluster are constantly voting to select the primary member, and the primary member will be the member which has the information that can be trusted by the rest of the members. The voting process happens several times per second. If, for whatever reason, the communication with the primary member is lost, during the next voting session the majority of the members will decide who must be the new primary member. In parallel to this, the minority (depending on the component role) can be “shut down” and stop serving new requests as a way to promote data consistency. Even number of members in a cluster would not allow for this design.

As soon as the connectivity among all members of the cluster is reestablished, the members that are part of the majority will feed data to the members of the minority group. Once they are all back in synch, every member will be part of a single group again and no conflicting information will be present.

All new modern technologies are now following this approach and many databases have been doing this for years. The quorum design improves the way of achieving high availability and should be expected to become the standard of clustering design in the next few years.

--

--

Bruno R Neves

Integration Specialist focused in the Healthcare and Life Sciences. Certified by CNCF, IBM, and OpenGroup. Views are my own.