Personally I have a weakness for big-ass storage. Say 'petabyte' and I'm interested. So I was thinking about how you would setup a large, scalable storage infrastructure. How should such a thing work?
Very simple: you should be able just to add hosts with some bad-ass huge RAID arrays attached to them. Maybe even not that huge, say 8 TB RAID 6 arrays or maybe bigger. You use these systems as building blocks to create a single and very large storage space. And then there is one additional requirement: as the number of these building blocks increase, you must be able to loose some and not loose data or availability. You should be able to continue operations without one or two of those storage building blocks before you would loose data and/or availability. Like RAID 5 or 6 but then over server systems instead of hard drives.
The hard part is in connecting all this separate storage to one virtual environment. A solution to this problem is Lustre.
Lustre is a network clustering filesystem. What does that mean? You can use Lustre to create a scalable storage platform. A single filesystem that can grow to multiple Petabytes. Lustre is deployed within production environments at large scale sites involving some of the fastest and largest computer clusters. Luster is thus something to take seriously.
Lustre stores all metadata about files on a separate MetaDataServer (MDS). Al actual file data is stored on Object Storage Targets (OSTs). These are just machines with one or more big RAID arrays (or simple disks) attached to them. The OSTs are not directly accessible by clients, but through an Object Storage Server (OSS). The data stored within a file can be striped over multiple OSTs for performance reasons. A sort of network RAID 0.
Lustre does not only allow scaling up to Petabytes of storage, but allows also a parallel file transfer performance in excess of 100 GB/s. How you like them apples? That is just wicked sick.
Just take a look at this diagram about how Lustre operates:
I'm not going into the details about Lustre. I want to discuss a shortcoming that may pose a serious risk of data loss: if you loose a single OST with any attached storage, you will lose all data stored on that OST.
Lustre cannot cope with the loss of a single OST! Even if you buy fully redundant hardware, with double RAID controllers, ECC memory, double PSU, etc, even then, if the motherboard gets fried, you will loose data. Surely not everything, but let's say 'just' 8 TB maybe?
I guess the risk is assumed to be low, because of the wide scale deployment of Lustre. Deployed by people who actually use it and have way more experience and knowledge than me about this whole stuff. So maybe I'm pointing out risks that are just very small. But I have seen server systems fail this bad as described. I don't think the risk, especially at this scale, is not that small.
The solution for Lustre to became truly awesome is to implement some kind of network based RAID 6 striping so you could loose one or even two OSTs and not have any impact on availability except maybe for performance. But it doesn't (yet).
This implies that you have to create your OSTs super-reliable, which would be very expensive (does not scale). Or have some very high-capacity backup solution, which would be able to restore some data. But you would have downtime.
So my question to you is: is there an actual scalable filesystem as Lustre that actually is capable of withstanding the failure of a single storage building block? If you have something to point out, please do.
BTW: please note that the loss of an OSS can be overcome because another OSS can take over the OSTs of a failed OSS.