Louwrentius

My Ceph Test Cluster Based on Raspberry Pi's and HP MicroServers

Sun 27 January 2019
Introduction

To learn more about Ceph, I've build myself a Ceph Cluster based on actual hardware. In this blogpost I'll discus the cluster in more detail and I've also included (fio) benchmark results.

This is my test Ceph cluster:

The cluster consists of the following components:
```
 3 x Raspberry Pi 3 Model B+ as Ceph monitors
 4 x HP MicroServer as OSD nodes (3 x Gen8 + 1 x Gen10)
 4 x 4 x 1 TB drives for storage (16 TB raw)
 3 x 1 x 250 GB SSD (750 GB raw)
 2 x 5-port Netgear switches for Ceph backend network (bonding)
```
Monitors: Raspberry Pi 3 Model B+

I've done some work getting Ceph compiled on a Raspberry Pi 3 Model B+ running Raspbian. I'm using three Raspberry Pi's as Ceph monitor nodes. The Pi boards don't break a sweat with this small cluster setup.

Note: Raspberry Pi's are not an ideal choice as a monitor node because Ceph Monitors write data (probably the cluster state) to disk every few seconds. This will wear out the SD card eventually.

Storage nodes: HP MicroServer

The storage nodes are based on four HP MicroServers. I really like these small boxes, they are sturdy, contain server-grade components, including ECC-memory and have room for four internal 3.5" hard drives. You can also install 2.5" hard drives or SSDs.

For more info on the Gen8 and the Gen10 click on their links.

Unfortunately the Gen8 servers are no longer made. The replacement, the Gen10 model, lacks IPMI/iLO and is also much more expensive (in Europe at least).

CPU and RAM

All HP Microservers have a dual-core CPU. The Gen8 servers have 10GB RAM and the Gen10 server has 12GB RAM. I've just added an 8GB ECC memory module to each server, the Gen10 comes with 4GB and the Gen8 came with only 2GB, which explains the difference.

Boot drive

The systems all have an (old) internal 2.5" laptop HDD connected to the internal USB 2.0 header using an USB enclosure.

Ceph OSD HDD

All servers are fitted with four (old) 1TB 7200 RPM 3.5" hard drives, so the entire cluster contains 16 x 1TB drives.

Ceph OSD SSD

There is a fifth SATA connector on the motherboard, meant for an optional optical drive, which I have no use for and wich is not included with the servers.

I use this SATA connector in the Gen8 MicroServers to attach a Crucial 250GB SSD, which is then tucked away at the top, where the optical drive would sit. So the Gen8 servers have an SSD installed which the Gen10 is lacking.

The entire cluster thus has 3 x 250GB SSDs installed.

Networking

All servers have two 1Gbit network cards on-board and a third one installed in one of the half-height PCIe slots¹.

The half-height PCIe NICs connect the Microservers to the public network. The internal gigabit NICs are configured in a bond (round-robin) and connected to two 5-port Netgear gigabit switches. This is the cluster network or the backend network Ceph uses for replicating data between the storage nodes.

You may notice that the first onboard NIC of each server is connected to the top switch and the second one is connected to the bottom switch. This is necessary because linux round-robin bonding requires either separate VLANs for each NIC or in this case separate switches.

Benchmarks

Benchmark conditions
- The tests ran on a physical Ceph client based on an older dual-core CPU and 8GB of RAM. This machine was connected to the cluster with a single gigabit network card.
- I've mapped RBD block devices from the HDD pool and the SSD pool on this machine for benchmarking.
- All tests have been performed on the raw /dev/rbd0 device, not on any file or filesystem.
- The pools use replication with a minimal copy count of 1 and a maximum of 3.
- All benchmarks have been performed with FIO.
- All benchmarks used random 4K reads/writes
```
    NAME     ID     USED        %USED     MAX AVAIL     OBJECTS
    hdd      36     1.47TiB     22.64       5.03TiB      396434
    ssd      38      200GiB     90.92       20.0GiB       51204
```
Benchmark SSD

Click on the images below to see a larger version.

Benchmark HDD

Benchmark evaluation

The random read performance of the hard drives seems unrealistic at higher queue depths and number of simultaneous jobs. This performance cannot be sustained purely on the basis that 16 hard drives with maybe 70 random IOPs each can only sustain 1120 random IOPs.

I cannot explain why I get these numbers. If anybody has a suggestion, feel free to comment/respond. Maybe the total of 42GB of memory across the cluster may act as some kind of cache.

Another interesting observation is that a low number of threads and a small IO queue depth results in fairly poor performance, both for SSD and HDD media.

Especially the performance of the SSD pool is poor with a low IO queue depth. A probable cause is that these SSDs are consumer-grade and don't perform well with low queue depth workloads.

I find it interesting that even over a single 1Gbit link, the SSD-backed pool is able to sustain 20K+ IOPs at higher queue depths and larger number of threads.

The small number of storage nodes and the low number of OSDs per node doesn't make this setup ideal but it does seem to perform fairly decent, considering the hardware involved.
1. You may notice that the Pi's are missing in the picture because this is an older picture when I was running the monitors as virtual machines on hardware not seen in the picture. ↩
Tagged as : Ceph

Read and Post Comments
Compiling Ceph on the Raspberry Pi 3B+ (Armhf) Using Clang/LLVM

Sat 10 November 2018
UPDATE 2019 / 2020

There are official ARM64 binaries of Ceph that you can run on a 64-bit version of Ubuntu 18.04.

Important: I consider this page obsolete. I will keep it up for transparency's sake

Introduction

In this blog post I'll show you how to compile Ceph Luminous for the Raspberry Pi 3B+.

If you follow the instructions below you can compile Ceph on Raspbian. A note of warning: we will compile Ceph on the Raspberry Pi itself which takes a lot of time.

Ubuntu has packages for Ceph on armhf but I was never able to get Ubuntu working properly on the Raspberry Pi 3B+. Maybe that's just me and I did something wrong. Using existing Ceph packages on Ubuntu would probably be the fastest way to get up and running on the Raspberry Pi if it works for you.

This is my test Ceph cluster:
```
 3 x Raspberry Pi 3B+ as Ceph monitors. 
 4 x HP Microserver as OSD nodes.
 4 x 4 x 1 TB drives for storage (16 TB raw)
 3 x 1 x 250 GB SSD (750 GB raw)
 2 x 5-port Netgear switches for Ceph backend network (bonding)
```
For the impatient

If you just want the packages you can download this file and you'll get a set of .deb files which you need to install on your Raspberry Pi.

SECURITY WARNING: these packages are created by me, an unknown, untrusted person on the internet. As a general rule you should not download and install these packages as they could be malicious for all you know. If you want to be safe, compile Ceph yourself.

Skip to the section about installing the packages at the end for further installation instructions.

The problem with compiling Ceph for armhf

There are no armhf packages for Ceph because if you try to compile Ceph on armhf the compiler (gcc) will run out of virtual memory (about three gigabytes).

The solution

Daniel Glaser discovered that he could actually compile Ceph on armhf by using Clang/LLVM as the C++ compiler. This compiler seems to use less memory and thus stay within the 3 GB memory boundary. This is why he and I were able to compile Ceph.

How to compile Ceph for armhf - preparation

The challenge: one gigabyte of memory

The Raspberry Pi 3B+ has only one gigabyte of memory but we need more. The only way to add memory is to use swap on disk, as far as I know.

If you use storage as a substitute for RAM memory, you need fast storage, so it's really recommended to use an external SSD drive that you connect through USB. You also may need sufficient storage, I'd recommend 20+ GB.

SD memory cards are not up to the task regarding being used as swap. You'll wear them out prematurely and performance is abysmal. You should really use an external SSD.

Preparing the external SSD
```
Attach the SSD drive to the Raspberry Pi with USB
```
1. The SSD will probably show up as '/dev/sda'.
2. mkfs.xfs /dev/sda -f ( this will erase all contents of the SSD ).
3. mkdir /mnt/ssd
4. mount /dev/sda /mnt/ssd
Creating and activating swap
1. cd /mnt/ssd
2. dd if=/dev/zero of=swap.dd bs=1M count=5000
3. swapon /mnt/ssd/swap.dd
4. swapoff /var/swap
By default, Raspbian configures a 100 MB swap file on /var/swap. In order to increase performance and protect the SD card from wearing out, please don't forget this last step to disable this swap file on the SD card.

Extra software

I would recommend installing 'htop' for real-time monitoring of cpu, memory and swap usage if you like to do so.
1. apt-get install htop
How to compile Ceph for armhf - building

Installing an alternative C++ compiler (Clang/LLVM)

As part of Daniel's instructions, you need to compile and install Clang/LLVM. I followed his instructions to the letter, I have not tested the Clang/LLVM packages made available through apt.

Compiling Clang/LLVM takes a lot of time. It took 8 hours to compile LLVM/Clang on a Raspberry Pi 3B+ with make -j3 to limit memory usage.
```
real    493m38.472s
user    1223m39.063s
sys 45m45.748s
```
I'll reproduce the steps from Daniel here:
```
apt update
apt install -y build-essential ca-certificates vim git 
apt install libcunit1-dev libcurl4-openssl-dev python-bcrypt python-tox python-coverage

cd /mnt/ssd
mkdir git && cd git
git clone https://github.com/llvm-mirror/llvm.git
cd llvm/tools
git clone https://github.com/llvm-mirror/clang.git
git clone https://github.com/llvm-mirror/lld.git
cd /tmp
mkdir llvm-build && cd llvm-build
cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=ARM /mnt/ssd/git/llvm/
make -j3
make install
update-alternatives --install /usr/bin/cc cc /usr/local/bin/clang 100
update-alternatives --install /usr/bin/c++ c++ /usr/local/bin/clang++ 100
update-alternatives --install /usr/bin/cpp cpp /usr/local/bin/clang-cpp 100
```
You may chose to build in some other directory, maybe on the SSD itself. I'm not sure if that makes a big difference. Be carefull when using /tmp as all contents are lost after a reboot.

Obtaining Ceph

There are two options: 1. clone my Luminous fork containing the branch 'ceph-on-arm' which incorporates all the 'fixed' files that make Ceph build with Clang/LLVM.
1. Clone the official Ceph repo and use the luminous branche. Next, you edit all the relevant files and make the changes yourself. Here you can find a list of all the files and the changes made to them.
I would recommend to just git clone ceph like this:
```
cd /mnt/ssd
git clone https://github.com/louwrentius/ceph
cd ceph
git checkout ceph-on-arm
git reset --hard
git clean -dxf
git submodule update --init --recursive
```
Now we first need to install a lot of dependancies on the Raspberry Pi before we can build Ceph.
```
run ./install-deps.sh
```
This will take some time as a ton of packages will be installed. Once this is done we are ready to compile Ceph itself.

Building Ceph

So to understand what you are getting into: it took me about 12 hours to compile Ceph on a Raspberry Pi 3 B+
```
real    717m31.457s
user    1319m50.438s
sys 58m7.549s
```
This is the command to run:
```
./make-debs.sh
```
If you want to monitor cpu and memory usage, you can use 'htop' to do so.

If for some reason the compile proces does fail and you may have to restart compiling ceph after you made some adjustments:

(you may have to adjust the folder name to match your ceph version)
```
cd /tmp/release/Raspbian/WORKDIR/ceph-12.2.9-39-gd51dfb14f4
< edit the relevant files here >
dpkg-buildpackage -j3 -us -us -nc
```
Once this process is done, you will find a lot of .deb packages in your /tmp/release/Raspbian/WORKDIR folder.

Warning If you do use /tmp, the first thing to do is to copy all .deb files to a safe location because if you reboot your Pi, you loose 12 hours of work.

Assuming that you copied all .deb files to a folder like '/deb' you just created, this is how you install these packages:
```
dpkg --install *.deb
apt-get install --fix-missing
apt --fix-broken install
```
This is a bit ugly but it worked fine for me.

You can now just copy over all the .deb files to other Raspbery Pi's and install Ceph on them too.

Now you are done and you can run Ceph on a Raspberry Pi 3B+.

Ceph monitors may wear out the SD card

Important Running a Ceph monitor node on a Raspberry Pi is not ideal. The core issue is that the Ceph monitor process writes data every few seconds to files within /var/lib/ceph and this may wear out the SD card prematurely. The solution would be to use an external usb hard drive mounted through USB or a regular ssd which is way more resilient to writes than a regular SD card.
Tagged as : Ceph

Read and Post Comments
Understanding Ceph: Open-Source Scalable Storage

Sun 19 August 2018
Introduction

In this blog post I will try to explain why I believe Ceph is such an interesting storage solution. After you finished reading this blog post you should have a good high-level overview of Ceph.

I've written this blog post purely because I'm a storage enthusiast and I find Ceph interesting technology.

What is Ceph?

Ceph is a software-defined storage solution that can scale both in performance and capacity. Ceph is used to build multi-petabyte storage clusters.

For example, Cern has build a 65 Petabyte Ceph storage cluster. I hope that number grabs your attention. I think it's amazing.

The basic building block of a Ceph storage cluster is the storage node. These storage nodes are just commodity (COTS) servers containing a lot of hard drives and/or flash storage.

Example of a storage node

Ceph is meant to scale. And you scale by adding additional storage nodes. You will need multiple servers to satisfy your capacity, performance and resiliency requirements. And as you expand the cluster with extra storage nodes, capacity, performance and resiliency (if needed) will all increase at the same time.

It's that simple.

You don't need to start with petabytes of storage. You can actually start very small, with just a few storage nodes and expand as your needs increase.

I want to touch upon a technical detail because it illustrates the mindset surrounding Ceph. With Ceph, you don't even need a RAID controller anymore, a 'dumb' HBA is sufficient. This is possible because Ceph manages redundancy in software. A Ceph storage node at it's core is more like a JBOD. The hardware is simple and 'dumb', the intelligence resides all in software.

This means that the risk of hardware vendor lock-in is quite mitigated. You are not tied to any particular proprietary hardware.

What makes Ceph special?

At the heart of the Ceph storage cluster is the CRUSH algoritm, developed by Sage Weil, the co-creator of Ceph.

The CRUSH algoritm allows storage clients to calculate which storage node needs to be contacted for retrieving or storing data. The storage client can - on it's own - determine what to do with data or where to get it.

So to reiterate: given a particular state of the storage cluster, the client can calculate which storage node to contact for storage or retrieval of data.

Why is this so special?

Because there is no centralised 'registry' that keeps track of the location of data on the cluster (metadata). Such a centralised registry can become:
- a performance bottleneck, preventing further expansion
- a single-point-of-failure
Ceph does away with this concept of a centralised registry for data storage and retrieval. This is why Ceph can scale in capacity and performance while assuring availability.

At the core of the CRUSH algoritm is the CRUSH map. That map contains information about the storage nodes in the cluster. That map is the basis for the calculations the storage client need to perform in order to decide which storage node to contact.

This CRUSH map is distributed across the cluster from a special server: the 'monitor' node. Regardless of the size of the Ceph storage cluster, you typically need just three (3) monitor nodes for the whole cluster. Those nodes are contacted by both the storage nodes and the storage clients.

So Ceph does have some kind of centralised 'registry' but it serves a totally different purpose. It only keeps track of the state of the cluster, a task that is way easier to scale than running a 'registry' for data storage/retrieval itself.

It's important to keep in mind that the Ceph monitor node does not store or process any metadata. It only keeps track of the CRUSH map for both clients and individual storage nodes. Data always flows directly from the storage node towards the client and vice versa.

Ceph Scalability

A storage client will contact the appropriate storage node directly to store or retrieve data. There are no components in between, except for the network, which you will need to size accordingly¹.

Because there are no intermediate components or proxies that could potentially create a bottleneck, a Ceph cluster can really scale horizontally in both capacity and performance.

And while scaling storage and performance, data is protected by redundancy.

Ceph redundancy

Replication

In a nutshell, Ceph does 'network' RAID-1 (replication) or 'network' RAID-5/6 (erasure encoding). What do I mean by this? Imagine a RAID array but now also imagine that instead of the array consisting of hard drives, it consist of entire servers.

That's what Ceph does: it distributes the data across multiple storage nodes and assures that the copy of a piece of data is never stored on the same storage node.

This is what happens if a client writes two blocks of data:

Notice how a copy of the data block is always replicated to other hardware.

Ceph goes beyond the capabilities of regular RAID. You can configure more than one replica. You are not confined to RAID-1 with just one backup copy of your data². The only downside of storing more replicas is the storage cost.

You may decide that data availability is so important that you may have to sacrifice space and absorb the cost. Because at scale, a simple RAID-1 replication scheme may not sufficiently cover the risk and impact of hardware failure anymore. What if two storage nodes in the cluster die?

This example or consideration has nothing to do with Ceph, it's a reality you face when you operate at scale.

RAID-1 or the Ceph equivalent 'replication' offers the best overall performance but as with 'regular' RAID-1, it is not very storage space efficient. Especially if you need more than one replica of the data to achieve the level of redundancy you need.

This is why we used RAID-5 and RAID-6 in the past as an alternative to RAID-1 or RAID-10. Parity RAID assures redundancy but with much less storage overhead at the cost of storage performance (mostly write performance). Ceph uses 'erasure encoding' to achieve a similar result.

Erasure Encoding

With Ceph you are not confined to the limits of RAID-5/RAID-6 with just one or two 'redundant disks' (in Ceph's case storage nodes). Ceph allows you to use Erasure Encoding, a technique that let's you tell Ceph this:

"I want you to chop up my data in 8 data segments and 4 parity segments"

These segments are then scattered across the storage nodes and this allows you to lose up to four entire hosts before you hit trouble. You will have only 33% storage overhead for redundancy instead of 50% (or even more) you may face using replication, depending on how many copies you want.

This example does assume that you have at least 8 + 4 = 12 storage nodes. But any scheme will do, you could do 6 data segments + 2 parity segments (similar to RAID-6) with only 8 hosts. I think you catch the idea.

Ceph failure domains

Ceph is datacenter-aware. What do I mean by that? Well, the CRUSH map can represent your physical datacenter topology, consisting of racks, rows, rooms, floors, datacenters and so on. You can fully customise your topology.

This allows you to create very clear data storage policies that Ceph will use to assure the cluster can tollerate failures across certain boundaries.

An example of a topology:

If you want, you can lose a whole rack. Or a whole row of racks and the cluster could still be fully operational, although with reduced performance and capacity.

That much redundancy may cost so much storage that you may not want to employ it for all of your data. That's no problem. You can create multiple storage pools that each have their own protection level and thus cost.

How do you use Ceph?

Ceph at it's core is an object storage solution. Librados is the library you can include within your software project to access Ceph storage natively. There are Librados implementations for the following programming languages:
- C(++)
- Java
- Python
- PHP
Many people are looking for more traditional storage solutions, like block storage for storing virtual machines, a POSIX compliant shared file system or S3/OpenStack Swift compatible object storage.

Ceph provides all those features in addition to it's native object storage format.

I myself are mostly interested in block storage (Rados Block Device)(RBD) with the purpose of storing virtual machines. As Linux has native support for RBD, it makes total sense to use Ceph as a storage backend for OpenStack or plain KVM.

With very recent versions of Ceph, native support for iSCSI has been added to expose block storage to non-native clients like VMware or Windows. For the record, I have no personal experience with this feature (yet).

The Object Storage Daemon (OSD)

In this section we zoom in a little bit more into the technical details of Ceph.

If you read about Ceph, you read a lot about the OSD or object storage daemon. This is a service (daemon) that runs on the storage node. The OSD is the actual workhorse of Ceph, it serves the data from the hard drive or ingests it and stores it on the drive. The OSD also assures storage redunancy, by replicating data to other OSDs based on the CRUSH map.

To be precise: for every hard drive or solid state drive in the storage node, an OSD will be active. Does your storage node have 24 hard drives? Then it runs 24 OSDs.

And when a drive goes down, the OSD will go down too and the monitor nodes will redistribute an update CRUSH map so the clients are aware and know where to get the data. The OSDs also respond to this update, because redundancy is lost, they may start to replicate non-redundant data to make it redundant again (across fewer nodes).

When the drive is replaced, the cluster will 'self-heal'. This means that the new drive will be filled with data once again to make sure data is spread evenly across all drives within the cluster.

So maybe it's interesting to realise that storage clients effectively directly talk to the OSDs that in turn talk to the individual hard drives. There aren't many components between the client and the data itself.

Closing words

I hope that this blog post has helped you understand how Ceph works and why it is so interesting. If you have any questions or feedback please feel free to comment or email me.
1. If you have a ton of high-volume sequential data storage traffic, you should realise that a single host with a ton of drives can easily saturate 10Gbit or theoretically even 40Gbit. I'm assuming 150 MB/s per hard drive. With 36 hard drives you would face 5.4 GB/s. Even if you only would run half that speed, you would need to bond multiple 10Gbit interfaces to sustain this load. Imagine the requirements for your core network. But it really depends on your workload. You will never reach this kind of throughput with a ton of random I/O unless you are using SSDs, for instance. ↩
2. Please note that in production setups, it's the default to have a total of 3 instances of a data block. So that means 'the original' plus two extra copies. See also this link. Thanks to sep76 from Reddit to point out that the default is 3 instances of your data. ↩
Tagged as : Ceph

Read and Post Comments