Louwrentius

Articles in the Storage category

Compiling Ceph on the Raspberry Pi 3B+ (Armhf) Using Clang/LLVM

Sat 10 November 2018
UPDATE 2019 / 2020

There are official ARM64 binaries of Ceph that you can run on a 64-bit version of Ubuntu 18.04.

Important: I consider this page obsolete. I will keep it up for transparency's sake

Introduction

In this blog post I'll show you how to compile Ceph Luminous for the Raspberry Pi 3B+.

If you follow the instructions below you can compile Ceph on Raspbian. A note of warning: we will compile Ceph on the Raspberry Pi itself which takes a lot of time.

Ubuntu has packages for Ceph on armhf but I was never able to get Ubuntu working properly on the Raspberry Pi 3B+. Maybe that's just me and I did something wrong. Using existing Ceph packages on Ubuntu would probably be the fastest way to get up and running on the Raspberry Pi if it works for you.

This is my test Ceph cluster:
```
 3 x Raspberry Pi 3B+ as Ceph monitors. 
 4 x HP Microserver as OSD nodes.
 4 x 4 x 1 TB drives for storage (16 TB raw)
 3 x 1 x 250 GB SSD (750 GB raw)
 2 x 5-port Netgear switches for Ceph backend network (bonding)
```
For the impatient

If you just want the packages you can download this file and you'll get a set of .deb files which you need to install on your Raspberry Pi.

SECURITY WARNING: these packages are created by me, an unknown, untrusted person on the internet. As a general rule you should not download and install these packages as they could be malicious for all you know. If you want to be safe, compile Ceph yourself.

Skip to the section about installing the packages at the end for further installation instructions.

The problem with compiling Ceph for armhf

There are no armhf packages for Ceph because if you try to compile Ceph on armhf the compiler (gcc) will run out of virtual memory (about three gigabytes).

The solution

Daniel Glaser discovered that he could actually compile Ceph on armhf by using Clang/LLVM as the C++ compiler. This compiler seems to use less memory and thus stay within the 3 GB memory boundary. This is why he and I were able to compile Ceph.

How to compile Ceph for armhf - preparation

The challenge: one gigabyte of memory

The Raspberry Pi 3B+ has only one gigabyte of memory but we need more. The only way to add memory is to use swap on disk, as far as I know.

If you use storage as a substitute for RAM memory, you need fast storage, so it's really recommended to use an external SSD drive that you connect through USB. You also may need sufficient storage, I'd recommend 20+ GB.

SD memory cards are not up to the task regarding being used as swap. You'll wear them out prematurely and performance is abysmal. You should really use an external SSD.

Preparing the external SSD
```
Attach the SSD drive to the Raspberry Pi with USB
```
1. The SSD will probably show up as '/dev/sda'.
2. mkfs.xfs /dev/sda -f ( this will erase all contents of the SSD ).
3. mkdir /mnt/ssd
4. mount /dev/sda /mnt/ssd
Creating and activating swap
1. cd /mnt/ssd
2. dd if=/dev/zero of=swap.dd bs=1M count=5000
3. swapon /mnt/ssd/swap.dd
4. swapoff /var/swap
By default, Raspbian configures a 100 MB swap file on /var/swap. In order to increase performance and protect the SD card from wearing out, please don't forget this last step to disable this swap file on the SD card.

Extra software

I would recommend installing 'htop' for real-time monitoring of cpu, memory and swap usage if you like to do so.
1. apt-get install htop
How to compile Ceph for armhf - building

Installing an alternative C++ compiler (Clang/LLVM)

As part of Daniel's instructions, you need to compile and install Clang/LLVM. I followed his instructions to the letter, I have not tested the Clang/LLVM packages made available through apt.

Compiling Clang/LLVM takes a lot of time. It took 8 hours to compile LLVM/Clang on a Raspberry Pi 3B+ with make -j3 to limit memory usage.
```
real    493m38.472s
user    1223m39.063s
sys 45m45.748s
```
I'll reproduce the steps from Daniel here:
```
apt update
apt install -y build-essential ca-certificates vim git 
apt install libcunit1-dev libcurl4-openssl-dev python-bcrypt python-tox python-coverage

cd /mnt/ssd
mkdir git && cd git
git clone https://github.com/llvm-mirror/llvm.git
cd llvm/tools
git clone https://github.com/llvm-mirror/clang.git
git clone https://github.com/llvm-mirror/lld.git
cd /tmp
mkdir llvm-build && cd llvm-build
cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=ARM /mnt/ssd/git/llvm/
make -j3
make install
update-alternatives --install /usr/bin/cc cc /usr/local/bin/clang 100
update-alternatives --install /usr/bin/c++ c++ /usr/local/bin/clang++ 100
update-alternatives --install /usr/bin/cpp cpp /usr/local/bin/clang-cpp 100
```
You may chose to build in some other directory, maybe on the SSD itself. I'm not sure if that makes a big difference. Be carefull when using /tmp as all contents are lost after a reboot.

Obtaining Ceph

There are two options: 1. clone my Luminous fork containing the branch 'ceph-on-arm' which incorporates all the 'fixed' files that make Ceph build with Clang/LLVM.
1. Clone the official Ceph repo and use the luminous branche. Next, you edit all the relevant files and make the changes yourself. Here you can find a list of all the files and the changes made to them.
I would recommend to just git clone ceph like this:
```
cd /mnt/ssd
git clone https://github.com/louwrentius/ceph
cd ceph
git checkout ceph-on-arm
git reset --hard
git clean -dxf
git submodule update --init --recursive
```
Now we first need to install a lot of dependancies on the Raspberry Pi before we can build Ceph.
```
run ./install-deps.sh
```
This will take some time as a ton of packages will be installed. Once this is done we are ready to compile Ceph itself.

Building Ceph

So to understand what you are getting into: it took me about 12 hours to compile Ceph on a Raspberry Pi 3 B+
```
real    717m31.457s
user    1319m50.438s
sys 58m7.549s
```
This is the command to run:
```
./make-debs.sh
```
If you want to monitor cpu and memory usage, you can use 'htop' to do so.

If for some reason the compile proces does fail and you may have to restart compiling ceph after you made some adjustments:

(you may have to adjust the folder name to match your ceph version)
```
cd /tmp/release/Raspbian/WORKDIR/ceph-12.2.9-39-gd51dfb14f4
< edit the relevant files here >
dpkg-buildpackage -j3 -us -us -nc
```
Once this process is done, you will find a lot of .deb packages in your /tmp/release/Raspbian/WORKDIR folder.

Warning If you do use /tmp, the first thing to do is to copy all .deb files to a safe location because if you reboot your Pi, you loose 12 hours of work.

Assuming that you copied all .deb files to a folder like '/deb' you just created, this is how you install these packages:
```
dpkg --install *.deb
apt-get install --fix-missing
apt --fix-broken install
```
This is a bit ugly but it worked fine for me.

You can now just copy over all the .deb files to other Raspbery Pi's and install Ceph on them too.

Now you are done and you can run Ceph on a Raspberry Pi 3B+.

Ceph monitors may wear out the SD card

Important Running a Ceph monitor node on a Raspberry Pi is not ideal. The core issue is that the Ceph monitor process writes data every few seconds to files within /var/lib/ceph and this may wear out the SD card prematurely. The solution would be to use an external usb hard drive mounted through USB or a regular ssd which is way more resilient to writes than a regular SD card.
Tagged as : Ceph

Read and Post Comments
Understanding Ceph: Open-Source Scalable Storage

Sun 19 August 2018
Introduction

In this blog post I will try to explain why I believe Ceph is such an interesting storage solution. After you finished reading this blog post you should have a good high-level overview of Ceph.

I've written this blog post purely because I'm a storage enthusiast and I find Ceph interesting technology.

What is Ceph?

Ceph is a software-defined storage solution that can scale both in performance and capacity. Ceph is used to build multi-petabyte storage clusters.

For example, Cern has build a 65 Petabyte Ceph storage cluster. I hope that number grabs your attention. I think it's amazing.

The basic building block of a Ceph storage cluster is the storage node. These storage nodes are just commodity (COTS) servers containing a lot of hard drives and/or flash storage.

Example of a storage node

Ceph is meant to scale. And you scale by adding additional storage nodes. You will need multiple servers to satisfy your capacity, performance and resiliency requirements. And as you expand the cluster with extra storage nodes, capacity, performance and resiliency (if needed) will all increase at the same time.

It's that simple.

You don't need to start with petabytes of storage. You can actually start very small, with just a few storage nodes and expand as your needs increase.

I want to touch upon a technical detail because it illustrates the mindset surrounding Ceph. With Ceph, you don't even need a RAID controller anymore, a 'dumb' HBA is sufficient. This is possible because Ceph manages redundancy in software. A Ceph storage node at it's core is more like a JBOD. The hardware is simple and 'dumb', the intelligence resides all in software.

This means that the risk of hardware vendor lock-in is quite mitigated. You are not tied to any particular proprietary hardware.

What makes Ceph special?

At the heart of the Ceph storage cluster is the CRUSH algoritm, developed by Sage Weil, the co-creator of Ceph.

The CRUSH algoritm allows storage clients to calculate which storage node needs to be contacted for retrieving or storing data. The storage client can - on it's own - determine what to do with data or where to get it.

So to reiterate: given a particular state of the storage cluster, the client can calculate which storage node to contact for storage or retrieval of data.

Why is this so special?

Because there is no centralised 'registry' that keeps track of the location of data on the cluster (metadata). Such a centralised registry can become:
- a performance bottleneck, preventing further expansion
- a single-point-of-failure
Ceph does away with this concept of a centralised registry for data storage and retrieval. This is why Ceph can scale in capacity and performance while assuring availability.

At the core of the CRUSH algoritm is the CRUSH map. That map contains information about the storage nodes in the cluster. That map is the basis for the calculations the storage client need to perform in order to decide which storage node to contact.

This CRUSH map is distributed across the cluster from a special server: the 'monitor' node. Regardless of the size of the Ceph storage cluster, you typically need just three (3) monitor nodes for the whole cluster. Those nodes are contacted by both the storage nodes and the storage clients.

So Ceph does have some kind of centralised 'registry' but it serves a totally different purpose. It only keeps track of the state of the cluster, a task that is way easier to scale than running a 'registry' for data storage/retrieval itself.

It's important to keep in mind that the Ceph monitor node does not store or process any metadata. It only keeps track of the CRUSH map for both clients and individual storage nodes. Data always flows directly from the storage node towards the client and vice versa.

Ceph Scalability

A storage client will contact the appropriate storage node directly to store or retrieve data. There are no components in between, except for the network, which you will need to size accordingly¹.

Because there are no intermediate components or proxies that could potentially create a bottleneck, a Ceph cluster can really scale horizontally in both capacity and performance.

And while scaling storage and performance, data is protected by redundancy.

Ceph redundancy

Replication

In a nutshell, Ceph does 'network' RAID-1 (replication) or 'network' RAID-5/6 (erasure encoding). What do I mean by this? Imagine a RAID array but now also imagine that instead of the array consisting of hard drives, it consist of entire servers.

That's what Ceph does: it distributes the data across multiple storage nodes and assures that the copy of a piece of data is never stored on the same storage node.

This is what happens if a client writes two blocks of data:

Notice how a copy of the data block is always replicated to other hardware.

Ceph goes beyond the capabilities of regular RAID. You can configure more than one replica. You are not confined to RAID-1 with just one backup copy of your data². The only downside of storing more replicas is the storage cost.

You may decide that data availability is so important that you may have to sacrifice space and absorb the cost. Because at scale, a simple RAID-1 replication scheme may not sufficiently cover the risk and impact of hardware failure anymore. What if two storage nodes in the cluster die?

This example or consideration has nothing to do with Ceph, it's a reality you face when you operate at scale.

RAID-1 or the Ceph equivalent 'replication' offers the best overall performance but as with 'regular' RAID-1, it is not very storage space efficient. Especially if you need more than one replica of the data to achieve the level of redundancy you need.

This is why we used RAID-5 and RAID-6 in the past as an alternative to RAID-1 or RAID-10. Parity RAID assures redundancy but with much less storage overhead at the cost of storage performance (mostly write performance). Ceph uses 'erasure encoding' to achieve a similar result.

Erasure Encoding

With Ceph you are not confined to the limits of RAID-5/RAID-6 with just one or two 'redundant disks' (in Ceph's case storage nodes). Ceph allows you to use Erasure Encoding, a technique that let's you tell Ceph this:

"I want you to chop up my data in 8 data segments and 4 parity segments"

These segments are then scattered across the storage nodes and this allows you to lose up to four entire hosts before you hit trouble. You will have only 33% storage overhead for redundancy instead of 50% (or even more) you may face using replication, depending on how many copies you want.

This example does assume that you have at least 8 + 4 = 12 storage nodes. But any scheme will do, you could do 6 data segments + 2 parity segments (similar to RAID-6) with only 8 hosts. I think you catch the idea.

Ceph failure domains

Ceph is datacenter-aware. What do I mean by that? Well, the CRUSH map can represent your physical datacenter topology, consisting of racks, rows, rooms, floors, datacenters and so on. You can fully customise your topology.

This allows you to create very clear data storage policies that Ceph will use to assure the cluster can tollerate failures across certain boundaries.

An example of a topology:

If you want, you can lose a whole rack. Or a whole row of racks and the cluster could still be fully operational, although with reduced performance and capacity.

That much redundancy may cost so much storage that you may not want to employ it for all of your data. That's no problem. You can create multiple storage pools that each have their own protection level and thus cost.

How do you use Ceph?

Ceph at it's core is an object storage solution. Librados is the library you can include within your software project to access Ceph storage natively. There are Librados implementations for the following programming languages:
- C(++)
- Java
- Python
- PHP
Many people are looking for more traditional storage solutions, like block storage for storing virtual machines, a POSIX compliant shared file system or S3/OpenStack Swift compatible object storage.

Ceph provides all those features in addition to it's native object storage format.

I myself are mostly interested in block storage (Rados Block Device)(RBD) with the purpose of storing virtual machines. As Linux has native support for RBD, it makes total sense to use Ceph as a storage backend for OpenStack or plain KVM.

With very recent versions of Ceph, native support for iSCSI has been added to expose block storage to non-native clients like VMware or Windows. For the record, I have no personal experience with this feature (yet).

The Object Storage Daemon (OSD)

In this section we zoom in a little bit more into the technical details of Ceph.

If you read about Ceph, you read a lot about the OSD or object storage daemon. This is a service (daemon) that runs on the storage node. The OSD is the actual workhorse of Ceph, it serves the data from the hard drive or ingests it and stores it on the drive. The OSD also assures storage redunancy, by replicating data to other OSDs based on the CRUSH map.

To be precise: for every hard drive or solid state drive in the storage node, an OSD will be active. Does your storage node have 24 hard drives? Then it runs 24 OSDs.

And when a drive goes down, the OSD will go down too and the monitor nodes will redistribute an update CRUSH map so the clients are aware and know where to get the data. The OSDs also respond to this update, because redundancy is lost, they may start to replicate non-redundant data to make it redundant again (across fewer nodes).

When the drive is replaced, the cluster will 'self-heal'. This means that the new drive will be filled with data once again to make sure data is spread evenly across all drives within the cluster.

So maybe it's interesting to realise that storage clients effectively directly talk to the OSDs that in turn talk to the individual hard drives. There aren't many components between the client and the data itself.

Closing words

I hope that this blog post has helped you understand how Ceph works and why it is so interesting. If you have any questions or feedback please feel free to comment or email me.
1. If you have a ton of high-volume sequential data storage traffic, you should realise that a single host with a ton of drives can easily saturate 10Gbit or theoretically even 40Gbit. I'm assuming 150 MB/s per hard drive. With 36 hard drives you would face 5.4 GB/s. Even if you only would run half that speed, you would need to bond multiple 10Gbit interfaces to sustain this load. Imagine the requirements for your core network. But it really depends on your workload. You will never reach this kind of throughput with a ton of random I/O unless you are using SSDs, for instance. ↩
2. Please note that in production setups, it's the default to have a total of 3 instances of a data block. So that means 'the original' plus two extra copies. See also this link. Thanks to sep76 from Reddit to point out that the default is 3 instances of your data. ↩
Tagged as : Ceph

Read and Post Comments
Tracking Down a Faulty Storage Array Controller With ZFS

Thu 15 December 2016
One day, I lost two virtual machines on our DR environment after a storage vMotion.

Further investigation uncovered that any storage vMotion of a virtual machine residing on our DR storage array would corrupt the virtual machine's disks.

I could easily restore the affected virtual machines from backup and once that was done, continued my investigation.

I needed a way to quickly verifying if a virtual hard drive of a virtual machine was corrupted after a storage vMotion to understand what the pattern was.

First, I created a virtual machine based on Linux and installed ZFS. Then, I attached a second disk of about 50 gigabytes and formatted this drive with ZFS. Once I filled the drive using 'dd' to about 40 gigabytes I was ready to test.

ZFS was chosen for testing purposes because it stores hashes of all blocks of data. This makes it very simple to quickly detect any data corruption. If the hash doesn't match the hash generated from the data, you just detected corruption.

Other file systems don't store hashes and don't check for data corruption so they just trust the storage layer. It may take a while before you find out that data is corrupted.

I performed a storage vMotion of this secondary disk towards different datastores and then ran a 'zfs scrub' to track down any corruption. This worked better than expected: the scrub command would hang if the drive was corrupted by the storage vMotion. The test virtual machine required a reboot and a reformat of the secondary hard drive with ZFS as the previous file system, including data got corrupted.

After performing a storage vMotion on the drive in different directions, from different datastores to other datastores slowly a pattern emerged.
1. Storage vMotion corruption happened independent of the VMware ESXi host used.
2. a Storage vMotion never caused any issues when the disk was residing on our production storage array.
3. the corruption only happened when the virtual machine was stored on particular datastores on our DR storage array.
Now it got really 'interesting'. The thing is that our DR storage array has two separate storage controllers running in active-active mode. However, the LUNs are always owned by a particular controller. Although the other controller can take over from the controller who 'owns' the LUNs in case of a failure, the owner will process the I/O when everything is fine. Particular LUNs are thus handled by a particular controller.

So first I made a table where I listed the controllers and the LUNs it had ownership over, like this:
```
            Owner       
Controller      a               b
            LUN001          LUN002
            LUN003          LUN004
            LUN005          LUN006
```
Then I started to perform Storage vMotions of the ZFS disk from one LUN to the other. After performing several test, the pattern became quite obvious.
```
            LUN001  ->  LUN002  =   BAD
            LUN001  ->  LUN004  =   BAD
            LUN004  ->  LUN003  =   BAD
            LUN003  ->  LUN005  =   GOOD
            LUN005  ->  LUN001  =   GOOD
```
I continued to test some additional permutations but it became clear that only LUNs owned by controller b caused problems.

With the evidence in hand, I managed to convince our vendor support to replace storage controller b and that indeed resolved the problem. Data corruption due to a Storage vMotion never occurred after the controller was replaced.

There is no need to name/shame the vendor in this regard. The thing is that all equipment can fail and what can happen will happen. What really counts is: are you prepared?
Tagged as : ZFS

Read and Post Comments

Solar Status

71 TiB NAS

20C/40T 128G Server

Projects

Categories

Archive

2023

2021

2020

2019

2018

2016

2015

2014

2013

2012

2011

2010

2009

2008

Page 5 / 17