Building a FoundationDB Cluster: Roles, Classes, and Processes

Posted on 2019-06-15

FoundationDB is an outstanding database and it is a joy to use it as a developer. However, when it comes to running FDB in production, it is not always clear how many instances are needed and what classes should be assigned to processes. As of June 2019, FoundationDB doesn't (yet) have a "smart" tool that can automatically set the layout of the FDB cluster, and it is important to set the layout manually to achieve decent performance.

In this post, I'll writte about the best practices for configuring FDB cluster which I discovered while experimenting with the database and reading FoundationDB forums.

Roles, Classes, and Processes in FoundationDB

FoundationDB server processes are identical binaries, but it is possible to assign a process to a specific class or a role to prioritize a specific workload. The difference between a class and a role is that assigning a class to a process specifies preference to recruit one of the corresponding roles (based on how good a particular role fits to a class), and assigning a role will result in a process preferring a single workload. The relationships between classes and roles are defined in the Locality.cpp file.

Out of clutter, find simplicity

While no one prevents assigning a single role to a process, it is more flexible to assign storage, transaction and stateless classes to processes. Let's look at them more closely:

Storage class: best fit for the storage role, worst fit for the transaction log role. Other roles won't be recruited in a process with this class.

Transaction class: good fit for the transaction log role, okay fit for the proxy, resolver, log router, cluster controller roles, worst fit for the storage role.

Stateless class: good fit for the proxy, master, resolver, log router, cluster controller, data distributor, and ratekeeper processes. Storage and transaction log roles won't be assigned to the stateless process.

When FDB tries to figure out which role it will recruit in a process, the fitness priority is used: best fit > good fit > okay fit > unset fit > worst fit > never assign. For example, if there are two processes which both have a transaction and a stateless class, the proxy role will be assigned to the stateless class since it fits better, even though transaction class is also suitable for this role.

Please note that assigning a role to the process will be treated as a class with the best fit for that role.

Every individual has a role to play for the betterment of our cluster

So, if a process will be recruited for a particular role, what will it do? Let's look at them one by one:

Process with the storage role is responsible for storing key/value b-trees and serving reads to clients. It provides a consistent database snapshot for the last 5 seconds. A typical FDB cluster would have ~ 80-90% of all processes recruited as storage.
The transaction log process keeps an append-only log of mutations for ~ 7 seconds before the changes are pulled by the storage process, then deletes them. If a storage process is not able to pull the changes in time, the disk space used by log will grow.
The proxy sits between the client and transactional authority and coordinates the actions along the write transaction path. It also gives out recent read versions to the client.
The master keeps track of the commit version, assigns commit versions to transactions to provide global ordering.
The resolver checks transactions and aborts them if they're conflicting.
The data distributor is responsible for distributing data across shards.
The ratekeeper limits the transaction rate to prevent cluster overload.
The cluster controller knows the cluster configuration, provides it to new clients and reconfigures the cluster in the event of network disruption.
The log router will be recruited in remote datacenters for the purpose of supporting the log replication.

Now that we know more about the classes, roles, and processes, let's see what resource requirements different roles have.

Requirements and resources

Different process roles consume different resources such as CPU, IOPS, network bandwidth and RAM, and we need to be careful to not create resource contention that will lead to reduced performance.

The transaction Log is both IOPS and bandwidth hungry, and it's important to dedicate a whole disk for it. Memory consumption is ~ 1GB.

The storage is not IOPS hungry but it can saturate the CPU, so it's better to not run the storage role in one process with others. If your SSD is fast enough, having more than one storage processes will utilize it better. It consumes less bandwidth than tLog or proxy, but it usually requires ~ 4GB of memory on average.

The proxy is bandwidth & CPU hungry, and it is important to give this role enough CPU to keep the latencies low.

The resolver is involved in checking conflict ranges and if you don't have conflicts then CPU consumption will be low. No high bandwidth or high IOPS consumption is expected, but the memory consumption can around 1GB.

The rest of the stateless processes do not consume any significant resources and can be deployed without restrictions.

Based on the requirements above, there is a set of recommendations:

Don't run tLog and storage on a single disk.
Don't run anything else in a single process that has storage or proxy roles since they're CPU hungry.
Run each proxy on a different host.
It might be useful to dedicate a separate host for tLog due to its bandwidth consumption.
Don't leave unspecified processes as FDB will recruit a storage role there by default, and this can increase latency for other roles in this process.

These are not hard rules but rather recommendations that should improve the performance of your cluster. For example, if a proxy is consuming the whole CPU core but the network is not saturated, it makes sense to run another proxy process on the same host.

Building the cluster

Given that we understand more about roles and their requirements, let's see what might be reasonable cluster layouts.

2-node replication

If we want to tolerate a single machine failure (and make progress after such failure), we need 3 machines with 2 copy replication. However, for performance reasons, we should not share disks between tLog and storage roles. It means that we should have 6 separate disks. Often, cloud providers will give only one disk per node, that's why we need at least six different nodes to prevent performance penalties in the event of failure.

A minimal Layout #1: 6 nodes, 2 processes in each node:

tLog master
tLog cluster_controller
tLog stateless
storage proxy
storage proxy
storage stateless

In this layout, we try to utilize the network bandwidth as much as possible by placing tLogs and proxies on different nodes, and at the same time, we don't have a performance loss in case of failure because we have extra tLog and storage processes. You can easily scale this cluster by adding more storage processes, and adding more tLog, proxies, and resolver if they're overloaded.

A minimal Layout #2: 5 nodes, 2 processes in each node:

tLog master
tLog cluster_controller
storage proxy
storage proxy
storage stateless

Here we use fewer nodes, however, if tLog fails FDB will relocate the missing tLog to the same process together with a storage role. This will degrade the performance and increase latencies until the node is back.

Please note that the ratio of tLog : storage is not optimal here and it's done this way purely due to fault tolerance concerns. Usually, you need more storage processes than tLog processes. It depends on your workload though, if you have lots of heavy writes, then having more tLog processes is better.

Also, don't forget to configure three coordinators to keep the cluster running in the event of a node failure.

3-node replication

If you're unsure about the reliability of the available hosts, it is possible to use 3-node replication. Again, I'm making this layout given that only 1 disk is available per host, therefore we need to have 4 hosts for each tLog and storage role.

A minimal Layout #3: 8 nodes, 2 processes in each node:

tLog master
tLog cluster_controller
tLog stateless
tLog stateless
storage proxy
storage proxy
storage proxy
storage resolution

With this layout we don't share a node between the proxy and the tLog, disks aren't shared between stateful processes, and every process has a set class so FDB won't spawn a storage process where we wouldn't expect it. We already have quite some tLog roles here, so to scale this cluster we should add more storage nodes.

That's all, folks. I hope these layouts will make it easier to bootstrap a FoundationDB cluster. Do you have other cool ideas about the possible cluster layouts? Please share them in the comments!