The Henderson Homelab

My Ceph cluster, and everything it runs exists on three Cisco UCS C240 M3’s.

Each node has a range of 10 to 12 WD Red HDD’s in each. There are two NVMe PCIe cards (1 Sandisk Fusion IOMemory SX350 & 1 Sandisk Fusion IOMemory PX600). Each node includes 128GB of RAM and are connected to two 10Gbps networks (physically independent switches) for public and cluster networks.

  • rbd: stored on HDD devices
  • rbd-ssd: stored on SSD devices
  • lxd: stored on SSD devices
  • k8s-rbd: stored on SSD devices

Libvirtd Unable to Connect when Using RBD Storage Pools

I ran across a problem recently where attempting to list virtual machines was taking ~45 minutes through virsh and virt-manager; it turns out that the problem was actually due to this patch in libvirt for using RBD fast-diff. In my case the ‘default’ storage pool is actually a link to my RBD storage pool. and that patch checks for the enabled feature but does not check the flags to see if the object-map and fast-diff are invalid

Good News Everyone!

There has been a recent patch that solves this. Unfortunately some distributions have not caught up with it yet (looking at you Ubuntu Bionic). Anyhow, this will hopefully make its way down the various streams that package libvirtd and the problem will be sorted.

Creating Ceph Bluestore OSDs with Spinning Drives and SSDs for DB/WAL

As a consultant I work with Ceph using a downstream version of the product; so once in awhile I like to catch up on new features and functions that have not yet hit the downstream/supported version of the product; that process has led me to setting up my homelab (again) and using Ceph Nautilus as a base for storage.

Using ceph-volume

Ceph comes with a deployment and inspection tool called ceph-volume. Much like the older ceph-deploy tool, ceph-volume will allow you to inspect, prepare, and activate object storage daemons (OSDs). The advantages of ceph-volume include support for LVM, dm-cache, and it no longer relies/interacts with udev rules.

For my use case I have installed a single Fusion IOMemory card unto each of my nodes in order to deploy OSDs with faster storage for the DB and WAL devices. It’s a very good idea to read the Bluestore configuration reference as that is default for new OSD deployments. Take careful note of the recommendations for the use of a DB and WAL device.

If there is only a small amount of fast storage available (e.g., less than a gigabyte), we recommend using it as a WAL device. If there is more, provisioning a DB device makes more sense. The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fit).

Bluestore Configuration Reference

In my case, due to the access to the Fusion IOMemory card, I want to create enough partitions to support 11 OSDs and make them as large as possible for the DB device (which will put the WAL device on the same partition). My fast media is 931 GB of usable storage, if I split it evenly across all eleven OSDs I should end up with partitions ~84 GB in size. I like round numbers so those partitions are now 80 GB in size and the deployment command looks something like this.

root@ganymede:~# ceph-volume lvm prepare --bluestore --dmcrypt --data /dev/sdd --block.db /dev/fioa5

Be sure to replace the –data argument with the storage device and the –block.db argument needs to point to the partition on the fast storage you wish to use for the given OSD. After that I run the activation command for all OSDs on the node.

root@ganymede:~# ceph-volume lvm activate --all

Assuming everything has gone as expected the OSDs will start up and join the cluster and you’ll get all the speedy goodness of an SSD for the write ahead log and RocksDB.

Moving Drives From an Old Ceph Cluster to a New Ceph Cluster

Among the core functions of my homelab is a storage environment based on Ceph. For months I’ve been looking for, buying, and preparing new hardware and a server rack for an update to my lab. For the last week, I’ve been moving data from the old nodes to the new nodes. Today there was enough data moved to completely shutdown one old node and transfer the hard drives into the new machines. These are my notes of cleaning the drive partitions, preparing the flash device partitions, and adding the OSDs to the new cluster.

Wipe The Drives

I shutdown the old node and pulled the hardware, without removing any data from the old drives – just in case there was a need to restore something to the old cluster; luckily that was not the case and I moved forward with wiping the drives using the following commands.

root@titan:~# wipefs -a /dev/sdc
/dev/sdc: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31

Check For LVM Related Data

Some of my old drives were already using LVM and BlueStore, if you try to prepare an old drives that had any PV (physical volume) or LV (logical volume) data then the ceph-volume prepare command will fail with something similar to this:

root@europa:~# ceph-volume lvm prepare --bluestore --dmcrypt --data /dev/sdc --block.wal /dev/fioa3 --block.db /dev/fioa4
...
 stderr: Physical volume '/dev/sdc' is already in volume group 'ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4'
  Unable to add physical volume '/dev/sdc' to volume group 'ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4'
  /dev/sdc: physical volume not initialized.
...
-->  RuntimeError: command returned non-zero exit status: 5

Remove LVM Related Data

When you need to remove LVM data from the drive you’ll find the use of pvdisplay (to get the VG name) and vgremove are the easiest ways to solve the problem. Make sure you are looking at the correct device, I shortened the output below.

root@europa:~# pvdisplay
...
  PV Name               /dev/sdc
  VG Name               ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4
  PV Size               <7.28 TiB / not usable <1.34 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              1907721
  Free PE               0
  Allocated PE          1907721
  PV UUID               LIe071-C7gV-q1tq-iAAb-3V3p-ZA3i-3VEJZX

Then remove the PV and LV using the following and confirming that you want to remove the physical and logical volume.

root@europa:~# vgremove ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4
Do you really want to remove volume group "ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4" containing 1 logical volumes? [y/n]: y
Do you really want to remove and DISCARD active logical volume ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4/osd-block-404a4208-0d30-4b9a-a7a1-87a1898e924b? [y/n]: y
  Logical volume "osd-block-404a4208-0d30-4b9a-a7a1-87a1898e924b" successfully removed
  Volume group "ceph-eebc4ef5-712b-4924-b70c-1df6269fc9a4" successfully removed

Prepare the WAL and DB Devices

I was lucky enough to get my hands on some cheap IOFusion devices (these are EOL (End of Life) so using them in a production cluster would not be recommended. That warning aside, these drives are awesome and are sized just about right for my cluster. I used gdisk to prepare new partitions (1GB for the DB (metadata) portion of the device and 80GB for the WAL portion (roughly 10% of the storage device).

root@ganymede:~# gdisk /dev/fioa
GPT fdisk (gdisk) version 1.0.3

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): n
Partition number (3-128, default 3):
First sector (6-244140619, default = 20971776) or {+-}size{KMGTP}:
Last sector (20971776-244140619, default = 244140619) or {+-}size{KMGTP}: +1G
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300):
Changed type of partition to 'Linux filesystem'

Command (? for help): n
Partition number (4-128, default 4):
First sector (6-244140619, default = 21233920) or {+-}size{KMGTP}:
Last sector (21233920-244140619, default = 244140619) or {+-}size{KMGTP}: +80G
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300):
Changed type of partition to 'Linux filesystem'

Command (? for help): x

Expert command (? for help): c
Partition number (1-4): 3
Enter the partition's new unique GUID ('R' to randomize): R
New GUID is 302BDE02-F625-4B33-80F5-5EE0254AADB9

Expert command (? for help): c
Partition number (1-4): 4
Enter the partition's new unique GUID ('R' to randomize): R
New GUID is 2F4EF305-A7BA-42E0-B690-3D3CDCF28B29

Expert command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/fioa.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
root@ganymede:~# partprobe /dev/fioa

A quick note about the above. Notice that I dropped into expert (x) mode and set a random GUID (c, then R) on each of the new partitions. Be sure to run partprobe after you finish adding the new partitions and their new GUID.

Prepare and Activate the OSD

At this point all you should have to do is prepare the OSD.

root@ganymede:~# ceph-volume lvm prepare --bluestore --dmcrypt --data /dev/sdc --block.wal /dev/fioa3 --block.db /dev/fioa4
...
--> ceph-volume lvm prepare successful for: /dev/sdc

Then activate the OSD.

root@ganymede:~# ceph-volume lvm activate --all
...
--> ceph-volume lvm activate successful for osd ID: 4

Mounting CephFS From Multiple Clusters to a Single Machine using FUSE

For my new homelab cluster I’ve built up a fresh Ceph filesystem to store certain chunks of my data and found the need to mount both to one of my nodes. Normally I use ceph-fuse through /etc/fstab, so I simply modified with the following.

root@storage:~# grep fuse /etc/fstab
none	/mnt/storage/ceph	fuse.ceph	ceph.id=admin,ceph.conf=/etc/ceph/ceph.conf,_netdev,defaults  0 0
none	/mnt/storage/ceph-old	fuse.ceph	ceph.id=admin,ceph.conf=/etc/ceph-old/ceph.conf,_netdev,defaults  0 0

The /etc/ceph-old/ is a copy of my config files from the older cluster. In the /etc/ceph-old/ceph.conf file I added the following, since the keyring for the that cluster is not in the default path.

[client.admin]
keyring = /etc/ceph-old/ceph.client.admin.keyring

Anytime the ceph.conf from the old cluster is used so is the old keyring and the cluster mounts up just fine.

Filesystem     Type            Size  Used Avail Use% Mounted on
ceph-fuse      fuse.ceph-fuse  100T   91T  9.4T  91% /mnt/storage/ceph-old

Libvirtd: Using RBD for a CDROM Device

Mostly a note to self but others may find this snippet useful.
I use the following to install from a disk image stored in RBD. Make sure to fill in your own client username, secret UUID, and monitor addresses.

    <disk type='network' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='a487c228-159b-4197-8be9-8e0e0d2b8bd4'/>
      </auth>
      <source protocol='rbd' name='rados-pool/diskimage.iso'>
        <host name='monitor-1.address' port='6789'/>
        <host name='monitor-2.address' port='6789'/>
        <host name='monitor-3.address' port='6789'/>
      </source>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>