A couple days ago I have started to investigate and eliminate some possible single point of failures in my current Kubernetes setup.

One of the biggest single point of failure definitely is my master node. I followed most of the recommendations for a beginner and set up a single master node.

In theory, a single master node would not hurt too much. The Kubernetes cluster and your services that are running in it can survive a short period of time without the master node, which does the following:

  • Control the deployments and where they are ran (which pod goes to which node).
  • Monitor pod health, and if your applications can support and you have provided the right instructions, scale up your application so that it can serve more users.
  • Maintain the complex relationships between Ingress, Services, Deployment, Pods, Certificates, Secrets and other components in an IT infrastructure.

We will explore more on the benefits of Kubernetes in a later article.

Since master node does not run any of essential application services, a typical reboot of it after, say, operating system upgrade, does not have any impact to your Cluster.

However, prolonged outage of the master node will bring catastrophy, because it helps monitor and ochestrate your whole infrastructure. Imagine if you don't have the master node for a week (which is possible, if it is a server crash and you do not have a backup, and have to setup everything from the beginning) and you have 1,00 applications, and literally every day you have server crashes and SSL certificate expiries and no one is helping you to restart and renew everything automatically.

That's where multiple master node comes into picture and the steps for setting up a multiple master nodes are shown below.

Big picture on migration strategy

I originally had a single master node Kubernetes cluster running on my Raspberry Pi, running 3 worker nodes. The worker nodes are running some essential services such as my blog, the comments engine, data storage as well as a few other home automation services. 1 of the worker node also double up as the arbitor for my 3-node GlusterFS storage.

With limited capacity, I need to ensure that my blog stays up and running during migration, with minimum downtime. There are inofficial guides around the Internet on how to upgrade a single node Kubernetes cluster to a multi-node one. But I decided to stick with the official recommendation and do a migration of the nodes instead. I used the stacked configuration for multiple master node which means that its control-plane and configuration store (etcd) are residing in the same machine. The control-plane in 1 master node will only talk to the configuration store in the same master node. This configuration is good enough to probably up to around 50 nodes as it is very efficient.

Let's go back to the architecture diagram:

Very luckily I have GlusterFS on bare metal, hence when I am done and destroy my single node cluster, I do not have to re-configure GlusterFS again. This setup is very similar to other persist storage setup in Amazon using awsElasticBlockStore, or in Google using gcePersistentDisk.

I have also made a decision to make the bare metal of my GlusterFS as part of my cluster. If it has spare capacity I could allocate some workload there as well.

In high level, my migration path using the 6 machines that I have (4 Raspberry Pis, 2 low-powered bare metal PCEngine APUs) looks like this:

  1. Leave the GlusterFS untouched.
  2. Scale down my 6-nodes (4 Pis, 2 APUs) Kubernetes cluster into 2 nodes (1 master, 1 worker) with Raspberry Pi and PC Engine APU respectively, and serve all my workloads there for a short period of time of 4 hours.
  3. With 2 spare Raspberry Pi, create a redundent master node. This will allow me to trial and error during the setup, and use 1 APU and 1 Pi as a worker node. (4 nodes in total, 2 masters and 2 workers)
  4. Configure ingress-nginx, cert-manager in the new cluster
  5. Configure GlusterFS services in the new cluster
  6. Once the new cluster is setup and tested, I re-create the deployments of my blog and comments engine into the new cluster.
  7. After testing the deployments,  change my firewall to redirect to the new cluster instead of the old one, and re-create the SSL certificates.
  8. At this point of time, the cluster is re-created and migrated. Some further steps to ensure best-practice:
  9. Decommission the old, 2 nodes cluster. Now I have 2 more spare to add to the new cluster.
  10. Add another master node so that we have a 3 master nodes setup. This is required to maintain quarum in the master node.
  11. Add the last spare machine into the cluster as a worker node.
  12. Done

After this setup I should have some spare capacity to also run other workloads better, such as the home automation engine (Homebridge).

The processes of trial and error

I am not using any automation tools available in the Internet to perform this setup and stuck with the offical guide, using kubeadm. However, the offical guide do have some drawbacks as below:

  • Not all CNI network backends are equal. I found some does not work well once you go into multiple master nodes for ingress-nginx and cert-manager, which are essential to all setups.
  • CNI is also closely tied to MetalLB load balancer, so read the instructions from MetalLB carefully.
  • CNIs are hard to work with. I read a few days of documentations before making up my mind, and we haven't get into Network Policy yet.

And the offical guide does not provide much assistance there.

Preparation: Nothing to do with Kubernetes

Before you start, you will need a load balancer for your master node, which will:

  1. Serve as a single point of network traffic inception for the master node.
  2. Monitor the master node's availability and forward the network traffic to the "real" master nodes which are working.

You will probably need another Linux / Unix machine for this load balancer if you are on bare metal. For me, I am using OpenBSD relayd for this purpose. Since this is only for internal load-balancing hence the security risk is relatively low as long as you configure it properly.

In Linux, you can use haproxy.

For simplicity, you can configure the load balancer to do a TCP health check with port 6443 (the default), or if you are really paranoid, you can use TLS health check on the /livez path of the kube-apiserver. But you will have to ensure that you copy the right certificate to your load balancer.

The load balancer will accept https traffic in port 6443, and then forward the traffic to the kube-apiserver running on port 6443 of your master node.

Also, you will need to define a new network segment for the new cluster. You will typically need 2 network segments that are big enough and it must not:

  1. overlap with your current cluster, and
  2. overlap with your physical network's segment.

In a bare-metal setup, it is easiest to find a class A subnet (10.x.x.x).

The 2 network segments will be used by the CNI and Kubernetes to:

  1. Assign IP addresses to your pods, and
  2. Assign IP addresses to the services that are formed from the pods.

The 2 segments must also not overlapping each other. Typically it should also be large enough so that you can have a lot of pods and services. Please note that if you have a horizontally scalable deployment, it can easily consume a lot of pod IP addresses.

Setting up the first master node

I have followed much of the official guide for inital master node setup, except 1 point that I have to add the --apiserver-advertise-address flag into kubeadm script. Full script looks like below:

sudo kubeadm init --control-plane-endpoint "<master node load balancer DNS name>" \
  --upload-certs \
  --pod-network-cidr=10.116.0.0/16 \
  --service-cidr=10.16.0.0/12 \
  --apiserver-advertise-address=<IP address of primary interface of your node>

A lot of machines has multiple network interfaces. Specifying the --api-server-advertise-address will prevent kubeadm to incorrectly detect your network.

For safety, the pod-network-cidr and the service-cidr should not overlap with the current cluster's configuration. Refer to the preparation on the requirements of the 2 network segments.

If your load balancer is set up properly, you will get a respond similar to below after a while:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of the control-plane node running the following command on each as root:

  kubeadm join <master node load balancer name>:6443 --token <some token> \
    --discovery-token-ca-cert-hash sha256:<sha checksum of the ca-cert> \
    --control-plane --certificate-key <another token for joining master node>

Please note that the certificate-key gives access to cluster sensitive data, keep it secret!
As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use
"kubeadm init phase upload-certs --upload-certs" to reload certs afterward.

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join <master node load balancer name>:6443 --token <some token> \
    --discovery-token-ca-cert-hash sha256:<sha checksum of the ca hash> 

Take note of the above because it will be needed from time to time. It also asked you to deploy a CNI. But we can do this later after confirming that the 2nd master node is set up.

Setting up 2nd master node

I got into a lot of difficulties when setting up the 2nd master node. It looks like the official guide may have skipped a few steps, or there is a bug in kubeadm that prevents it from happening properly.

I got into a few hurdles:

  1. When the secondary master node is setting up, it will set up the control-plane and etcd (configuration store) on the same machine. However, kubeadm seems to have difficulty trying to get etcd up and running, and while it is all happening, all kubectl command in the first master node will not respond because the cluster thinks that etcd is not in healthy state and could not respond.
  2. As a result, kubeadm timed out because it failed to set up. And your primary master node is also dead because of the lingering configuration.
  3. I have tried and errored at least 5 times before I make it right.

And I will share the experiences below. Let's ensure the following preparations are done beforehand:

  1. If you are using docker (which I have as well, and if you follow the guide that I have used before, you will also be.), ensure that you can run docker ps properly and generate output but not error.
  2. Have 3 terminal windows ready in the secondary master node.
  3. In one of the 3 terminal windows, use the following command to watch the docker container processes:
watch docker ps
Continuously watching docker processes

This will allow you to monitor the containers that kubelet initiated during the setup process, to ensure that you are in the right track.

In another terminal window, enter the command that you've noted down before for joining a master node to become a control-plane, and add --apiserver-advertise-address <secondary master node IP> to the back of the command. So the whole command looks like below:

sudo kubeadm join <master node load balancer name>:6443 --token <some token> \
    --discovery-token-ca-cert-hash sha256:<sha checksum of the ca-cert> \
    --control-plane --certificate-key <another token for joining master node> \
    --apiserver-advertise_address <IP of the master node that wish to join>

Then monitor the logs. At some point of time, it will indicate that it is trying to start the ETCD container. Use the first watcher terminal window to monitor if this is being spawned. You will see something like below:

Every 2.0s: docker ps                                                                                                                                                                                                       node0: Mon Mar  1 12:33:59 2021

CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
cae1db844f8d        05b738aa1bc6           "etcd --advertise-cl…"   19 hours ago        Up 19 hours                             k8s_etcd_etcd-node0_kube-system_e73ac71ae6dc115491ef2ed6831b545b_2

In your node, plus some other dockers running.

If you cannot get the etcd running, then you use the third terminal window to run

sudo systemctl restart kubelet

to restart kubelet. It will allow kubelet to read the /etc/kubernetes/manifests folder and try to start up everything that is not running again quickly. Watch carefully until etcd started running.

After a while, the kubeadm that you are running in the 2nd terminal window will say that it is running successfully, and ask you to copy the admin configuration to your own folder so that you can adminster the cluster from 2nd node.

It is now time to modify your loadbalancer to add the 2nd master for load-balancing.

Once everything is up and running, when you execute:

kubectl get po -A -o wide

You will get 2 kube-apiserver running together with all the control-planes in 2 nodes.

For me, I sticked with 2 nodes and setup everything else before I add the 3rd node.

CNI plugin setup

This is another difficult setup. Flannel, the simpliest CNI plugin that allows Kubernetes to form a single, flat network (this is a very big assumption, and CNI is there to help making sure it happened this way) over all the nodes regardless of your node location, seems does not work well with ingress-nginx, which is crucial to ensure that my web traffic comes it.

I suspect it is related to MTU setup in Flannel, but I do struggle a bit on why it worked in my single master node setup, but not here. So anyway, I looked around and have chosen Calico as my network backend. It has more features, and its documentation is quite good. So let's try it out.

Kubernetes assumes that all pods, nodes and services can directly communicate with each other without network address translation. In a bare-metal small setup, this is not an issue because most home routers offered only a single, broadcast-able IP range (such as a very common class C range of 192.168.0.1 - 192.168.0.255).

However, in a complex environment it is not necessarily done this way. You may have demilitarized zone (DMZ), NAT setup, switches, routers, etc. that causes traffic to not talking to each other.

All CNI plugins provide the functionality to came across this issue and make the network looked like 1 giant, flat sheet. More complex CNIs (like Calico) also provides network policies to ensure that traffic is secured. They all relies on a lot of Linux iptables, conntrack, IPVS and other tricks to do their work, which you do not need to worry about most of the time.

It does so by implementating different kinds of layering on top of the existing network. They are called overlays. Since network traffic are done in packets, with each packet, typically of 1,500 bytes big, storing some metadata (around 10-15% of the whole packet) and some data (the remaining), if you overlay the network packets, additional metadata will need to be stored in the data area in order to allow the packets to make it route-able in a flat network. This is called encapsulation. And the size of the packet and data are managed by MTU (maximum transmission unit).

Calico makes no difference, however, it has an option that is smart enough that if all your nodes are in the same subnet, it will not do encapsulation and reduce overhead. Thus there is no need to meddle with the MTU. It also has a configuration that allows you to change the MTU if necessary.

Calico has 2 operational modes, BGP and VXLAN, to try to meet the basic assumptions of Kubernetes.

We will use VXLAN in this case, as MetalLB does not work well with BGP. On the other hand, if you are not careful with BGP, your network routing will get very messed up because it is a protocol to allow dynamically change of routes based on environment changes. In an enterprise world, it will also mean you may need to have lengthy discussion with your enterprise network team to discuss something that perhaps both of you would not understand easily...

First, download the Calico bare-metal manifest from Calico.

curl https://docs.projectcalico.org/manifests/calico.yaml -O

Then modify a couple of the lines to enable vxlan backend, disable BGP and set the MTU properly. Here is the diff (also available from Github).

@@ -10,12 +10,12 @@ data:
   # Typha is disabled.
   typha_service_name: "none"
   # Configure the backend to use.
   calico_backend: "bird"
   calico_backend: "vxlan"

   # Configure the MTU to use for workload interfaces and tunnels.
   # By default, MTU is auto-detected, and explicitly setting this field should not be required.
   # You can override auto-detection by providing a non-zero value.
   veth_mtu: "0"
   veth_mtu: "1400"

   # The CNI network configuration to install on each node. The special
   # values in this config will be automatically populated.
 @@ -3644,10 +3644,10 @@ spec:
               value: "autodetect"
             # Enable IPIP
             - name: CALICO_IPV4POOL_IPIP
               value: "Always"
               value: "Never"
             # Enable or Disable VXLAN on the default IP pool.
             - name: CALICO_IPV4POOL_VXLAN
               value: "Never"
               value: "Always"
             # Set MTU for tunnel device used if ipip is enabled
             - name: FELIX_IPINIPMTU
               valueFrom:
 @@ -3695,7 +3695,7 @@ spec:
               command:
               - /bin/calico-node
               - -felix-live
               - -bird-live
            #   - -bird-live
             periodSeconds: 10
             initialDelaySeconds: 10
             failureThreshold: 6
 @@ -3704,7 +3704,7 @@ spec:
               command:
               - /bin/calico-node
               - -felix-ready
               - -bird-ready
            #   - -bird-ready
             periodSeconds: 10
           volumeMounts:
             - mountPath: /lib/module
             s

Once this is done, apply the manifest and you will start to see Calico related pods being spawned and executed.

Also, you should install Calico control (calicoctl) so that you can easily configure accordingly. I installed it as a Kubernetes pod because that will allow me to use it in different master nodes.

Using calicoctl, I also modified the IP pool setting for my pod pool:

calicoctl get pool default-ipv4-ippool -o yaml
Get pool information in Calico and export as YAML so that you can modify and change it.

Make the relevant changes accordingly so that vxlan is only used when across subnet.

apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  creationTimestamp: "2021-02-28T02:23:34Z"
  name: default-ipv4-ippool
  resourceVersion: "42415"
  uid: 73cea9f1-aad0-49d5-bf69-ef66a61ecf9d
spec:
  blockSize: 26
  cidr: <your pool as indicated in Kubernetes setup>
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: CrossSubnet			# Before - Always

And apply it. If you are using Kubernetes pod method to install Calico, apply it like this:

calicoctl create -f - < ippool.yaml
Import the pool using "-f -" as stdin because you're in a pod.

Then network setup is done. Your default MTU is 1,400 bytes, and overlay network is only used when across subnet. If your physical network MTU is 1,500 bytes you should be fine.

Setting up new worker nodes

At this point of time, you should also add worker nodes so that ingress-nginx and cert-manager can be ran in those nodes.

Adding new worker nodes should be really simple. Simply reset the existing node configuration via:

sudo kubeadm reset
cd ~ && rm -R .kube
sudo rm -R /etc/cni/net.d
Resetting Kubernetes node, remove existing configuration

and reboot the machine. Rebooting the machine is the simpliest way to clear all the lingering iptables configuration that may have been done in the old configuration.

Then simply use the kubeadm join command to join the new cluster using the information you have captured before:

kubeadm join <master node load balancer name>:6443 --token <some token> \
    --discovery-token-ca-cert-hash sha256:<sha checksum of the ca hash>
Joining the new cluster, make sure you replace all the information in the square blanket with your own

Setting up ingress-nginx, cert-manager and MetalLB

After setting up the CNI is installed and at least 1 worker node is joined, setup ingress-nginx, cert-manager and MetalLB accordingly (referencing my previous guidelines).

You should also set up a production issuer. (Example configuration). If you do not setup MTU accordingly during the CNI setup, you may receive timeout when you setup the production certificate issuer (like me, who investigated for almost a night before figuring it out).

When you setup MetalLB, make sure you do not use the original LoadBalancer IP range in the new configuration, as it will create duplicate ARP entries in your network confusing everything.

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: default
      protocol: layer2
      addresses:
# IP Address range to be used for load balancing
      - 192.168.xxx.yyy-192.168.xxx.zzz    # Must not overlap with your
                                           # existing configuration
MetalLB configuration

Configure GlusterFS endpoint and services

If you, like me, is using GlusterFS as your persistent storage, you should also setup the endpoint and the services accordingly. My previously article talks great deal about it.

Setting up your other services and deployments

Once you have the whole infrastructure setup, including:

  • Master nodes (x2 so far)
  • Worker nodes (x2 so far)

You can start to deploy your services to the new cluster. If you have saved your existing manifest files, it is simply a matter of re-applying those in the new cluster.

When you are re-applying the deployment and services, make sure you do not deploy the ingress configurations at this moment, as your front firewall may not have configured properly.

We will not cover those here. If you are following my example, refer to my Github repository for the manifest files.

Once those are setup, use

kubectl get pods -o wide
kubectl get deployments -o wide
kubectl get svc -o wide

To see if everything is properly setup. In my examples, the Ghost and Schnack deployments come with liveness and readiness checks.

If you want to go to the details, use

kubectl logs <pod-name>

To review the logs generated from your application.

Change firewall configuration and re-apply for SSL certificates

With all deployments and services properly configured and verified. It is now time to change the firewall configuration so that you can:

  1. Allow LetsEncrypt to validate your certificate request through ingress-nginx, and
  2. Expose your services to the Internet.

When you configure MetalLB, you should have specified a new IP address for the load balancer and changed the ingress-nginx-controller service from NodePort to LoadBalancer.

You can get the new LoadBalancer IP address assigned by MetalLB using the following command:

kubectl get svc -n ingress-nginx

Results should look like below:

NAME                                 TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.29.202.116   192.168.xxx.yyy   80:31142/TCP,443:31184/TCP   30h
ingress-nginx-controller-admission   ClusterIP      10.23.26.79     <none>            443/TCP                      30h

192.168.xxx.yyy is your new LoadBalancer IP. You can redirect your traffic in your firewall to the new IP address.

Once that is done, apply the ingress related manifest in the new Cluster, and you should be able to get a new SSL certificate. Details on this can be found in my previous article.

Wrapping it up

At this point of time, after verification of the renewal of SSL certificates, your new multiple master node setup should be ready with 2 master nodes and 2 worker nodes.

Best practices in Kubernetes require odd number of master nodes, hence we will add another master node by repeating the steps above. You may get into errors because the uploaded certificates for adding master nodes is only valid for 2 hours.

Use the following command will re-upload the certificates required:

sudo kubeadm init phase upload-certs --upload-certs

Once this is done, a single line will appear in the terminal, which is the new CA upload token. Use the following command with the new CA key to join a third master node.

sudo kubeadm join <master node load balancer name>:6443 --token <some token> \
    --discovery-token-ca-cert-hash sha256:<sha checksum of the ca-cert> \
    --control-plane --certificate-key <new ca upload token> \
    --apiserver-advertise_address <IP of the master node that wish to join>

And monitor the progress accordingly. If needed, restart the kubelet service so that etcd can be started properly. When done, adjust your load balancer again to add the 3rd master node.

Once that is also done, you are free to add additional worker nodes, or add more master nodes if necessary. Just keep in mind you must use odd number of master nodes. Otherwise the control-plane of Kubernetes and etcd will not know what to do if the master nodes are not in sync.

Readout