Part 3 - Installing a Two-Node Proxmox VE Cluster
It all began with a desire to run a full Openshift cluster at home. Not minikube nor minishift nor single node stuff. I shaved yak and ended up building a homelab in an IKEA cabinet that is now ready to run a Hypervisor and then Openshift.
I looked up online for a opensource/free hypervisors and found out about VMWare ESXI, of course Red Hat Vistualization (OVirt) and Proxmox VE. I did not choose VMWare for 2 reasons, in older servers only older versions of VMWare ESXI can be installed. Newer version of ESXI does not support the older Xeon CPUs. It means you won't get the latest versions/updates. I could be wrong but this was what I read. The second reason is that I work for Red Hat, duh. For the same reason, I tried installing OVirt which is the upstream project of Red Hat Virtualization. However I find it too much for a home lab setup. There are too many good features that I do not need in a homelab and the hardware requirements are too high for my current setup. I tried installing it but it requires 10Gigabit Ethernet connection because it uses Ceph storage as the VM storage in a hyper-converge setup (storage and compute on the same nodes). So I gave ProxmoxVE a try and find that the installation is very easy and took only 3 minutes. Then after reboot you can start creating VMs right away. It maybe missing some other Enterprise features that ESXI, RHV and Hyper-V has but it's more than capable for my homelab needs.
Installing Proxmox VE
Installing is as easy as 123. I downloaded the installer ISO image from the Proxmox Website. Burned it on a USB stick. Let the server boot on the USB stick. Follow the instruction nad provide few information et voila!
The first thing I did after installation was to add the Proxmox apt repository, install ifupdown2 and then upgrade to latest binaries. You need to install ifupdown2 so that you can change and apply the network settings from the Proxmox Web UI.
The next thing I did was to configure a second NIC on the nodes in preparation for clustering.
I installed Proxmox VE on two dell servers and created a cluster. I reserved my third HP server as a separate node, not part of the cluster because I am not running this server 24/7. It's too loud. Creating a cluster was also a breeze. Just click the button "Create Cluster" on the UI under the Datacenter menu and follow the instruction. In the network options, choose the second network card, the one that is not used by the VMs (not mapped to vmbr0). On the other node just click on the "Join Cluster" button and provide the Join Information which you will get from the same menu item in the first node. Once the cluster setup is complete you will see all the VMs and storage in one place under the Cluster root item.
As you can see above, I distributed the Openshift VMs across two Proxmox nodes because this setup allows me to shut down one of the node without loosing the entire Openshift cluster. I have observed that when the entire Openshift cluster is shutdown, it doesn't always comes back up. The etcd is broken somehow. But if you leave a few nodes running, when the other nodes comes back, everything works just fine. With the above VM distribution, I can shutdown one server, work on it, say upgrade RAM. I will loose half of the Openshift nodes but it will continue to work (except for the pods that has PV claims on a local disk of a node that was shutdown) and turn it back on without loosing Openshift.
One problem I encountered when running a two-node ProxmoxVE cluster is that when one node is down, the cluster is out of quorum and the VMs are frozen. This is very common in distributed/cluster setups. I have seen this in Hadoop, Elasticsearch and Kafka, etc. Usually we use Odd number of nodes i.e 3,5,7 so that when the election for a new leader is invoked, there will be no tie. After reading Proxmox documentation, I have learned that just like RHEL's HA, it uses corosync for clustering and corosync has a thing called Quorum device. Qdevice in PVE is a third server that will not host any VMs. Its only job is to provide one vote as a tie breaker. It does not do a lot so you don't need a powerful machine. But it has to be separate from the 2 nodes physical node. I have seen folks online who running a QDevice as a VM inside PVE itself. But I don't think it's a good idea. So I used my rusty and dusty old Raspberry Pi 2B that I used in 2016 for my Hadoop cluster experiments as the PVE quorum device. With a quorum device, the PVE cluster remains in quorum even if one of the server is down.
|My old Hadoop+Spark cluster|
I burned Raspberry Pi OS former Raspbian (Debian) to a microSD card using the Raspberry Pi Imager for Mac and plug it into the Pi. It's required that you enabled ssh connection by root user on the Pi. You can also use any other OS as long as you can install corosync-qnetd. I prefer to use uniform OS flavour at the each level. I used Debian-based OSes (ProxmoxVE and Raspberry Pi OS) at virtualization layer and Red Hat based (Fedora CoreOS, CentOS) at the Openshift/Containerization layer.
To setup the quorum device:
Once you have nothing left to install on the raspberry Pi (because the Pi may loose internet connection after the next step), change the IP address so that it can communicate with the servers on the PVE cluster subnet. In my case all NICs are plugged in to the same switch. I don't have a managed switch and/or a VLAN-aware switch so I used subnets to separate the VM network and the clustering network. I used a 24-bit subnet mask and IP ranges 192.168.20.x/24. By default Pis have DHCP clients enabled. You will have to assign a static IP address.
Then you need to reboot the Pi.
On the PVE nodes, install corosync-qdevice and add the Pi as a quorum device.
The pvecm status command should show that there are 3 total votes. That's all that was done to create a 2-node Proxmox VE cluster that remains in quorum after loosing one node.
With the above setup, I can do live migration of VMs from one node to the other. But it takes a while because the VM disks reside in local LVM storage of the node. During a live migration, PVE copies the VM disks from one node to the other. In order to achieve high availability, I need to have a shared storage between the nodes where VM disks are stored.
I tried setting up Ceph on the same nodes as hyper-converge infra, but again, I encountered the same issue as when running Hyper-converged OVirt(RHV). Ceph slowed down the machine dramatically. Ceph requires 10Gb link, which I don't have. It made the VMs unstable and super slow with IO delays upto 20%. So I decided to ditch hypervisor-level availability. I decided to let Openshift deal with loss of worker or master nodes. Though I still have problems with the load balancer node (okd4-services). When this VM is lost, Openshift is not accesible (at least in a usual way).
I also tried setting up ZFS but this also created problems. Because with ZFS, I could setup a replication of VM disks periodically between the nodes. But again ZFS has a huge overhead. It adds up to 30% of IO delays in my tests. Making all the VMs extremely slow. This is probably because I am running ZFS on top of hardware raid (which is not recommended).
In summary, I let Openshift handle High-availability. When a worker or master node is lost, it will automatically reschedule the pods to other nodes. As long as there is enough resources in the remaining nodes to run the rest of the pods, I am safe.
There is just a couple of issues I still don't have a solution to yet.
- The first one is that the Openshift Load balancer (okd4-services) is a VM. So if this VM is lost because the host is down, Openshift won't be accessible.
- I don't have a highly-available shared storage to be used by Openshift Persistent Volumes. I am using local-storage operator to use local disks attached to the VMs as a persistent volumes. This means that if I lost a VM that is holding a PV on it's local disk, the pods that are using these PVs will fail.
Both nodes have hardware RAID setup. One node has RAID1 (2 disks) and the second node has RAID5 (4 disks). So I am covered in terms of a HDD loss. PVE host loss will be handled at the Openshift level, with outstanding issues as mentioned above. Both servers have redundant power supplies but I don't have a UPS. They are too damn expensive and I don't understand why. I may build my own cheap UPS in the future. I have done this a long time ago with a car battery (but not sure if one would want to put a lead acid battery inside their home) and an inverter.
Next up, creating the Openshift nodes, setting up DNS Server and Openshift installation.