1. Lustre and the Risk of Serious Data Loss

    July 03, 2010

    Personally I have a weakness for big-ass storage. Say 'petabyte' and I'm interested. So I was thinking about how you would setup a large, scalable storage infrastructure. How should such a thing work?

    Very simple: you should be able just to add hosts with some bad-ass huge RAID arrays attached to them. Maybe even not that huge, say 8 TB RAID 6 arrays or maybe bigger. You use these systems as building blocks to create a single and very large storage space. And then there is one additional requirement: as the number of these building blocks increase, you must be able to loose some and not loose data or availability. You should be able to continue operations without one or two of those storage building blocks before you would loose data and/or availability. Like RAID 5 or 6 but then over server systems instead of hard drives.

    The hard part is in connecting all this separate storage to one virtual environment. A solution to this problem is Lustre.

    Lustre is a network clustering filesystem. What does that mean? You can use Lustre to create a scalable storage platform. A single filesystem that can grow to multiple Petabytes. Lustre is deployed within production environments at large scale sites involving some of the fastest and largest computer clusters. Luster is thus something to take seriously.

    Lustre stores all metadata about files on a separate MetaDataServer (MDS). Al actual file data is stored on Object Storage Targets (OSTs). These are just machines with one or more big RAID arrays (or simple disks) attached to them. The OSTs are not directly accessible by clients, but through an Object Storage Server (OSS). The data stored within a file can be striped over multiple OSTs for performance reasons. A sort of network RAID 0.

    Lustre does not only allow scaling up to Petabytes of storage, but allows also a parallel file transfer performance in excess of 100 GB/s. How you like them apples? That is just wicked sick.

    Just take a look at this diagram about how Lustre operates:

    lustre schema

    I'm not going into the details about Lustre. I want to discuss a shortcoming that may pose a serious risk of data loss: if you loose a single OST with any attached storage, you will lose all data stored on that OST.

    Lustre cannot cope with the loss of a single OST! Even if you buy fully redundant hardware, with double RAID controllers, ECC memory, double PSU, etc, even then, if the motherboard gets fried, you will loose data. Surely not everything, but let's say 'just' 8 TB maybe?

    I guess the risk is assumed to be low, because of the wide scale deployment of Lustre. Deployed by people who actually use it and have way more experience and knowledge than me about this whole stuff. So maybe I'm pointing out risks that are just very small. But I have seen server systems fail this bad as described. I don't think the risk, especially at this scale, is not that small.

    I am certainly not the first to point out this risk.

    The solution for Lustre to became truly awesome is to implement some kind of network based RAID 6 striping so you could loose one or even two OSTs and not have any impact on availability except maybe for performance. But it doesn't (yet).

    This implies that you have to create your OSTs super-reliable, which would be very expensive (does not scale). Or have some very high-capacity backup solution, which would be able to restore some data. But you would have downtime.

    So my question to you is: is there an actual scalable filesystem as Lustre that actually is capable of withstanding the failure of a single storage building block? If you have something to point out, please do.

    BTW: please note that the loss of an OSS can be overcome because another OSS can take over the OSTs of a failed OSS.

    Tagged as : lustre ost failure data loss
  2. Recovering a Lost Partition Using Gpart

    December 22, 2009

    Even today people do not understand how important it is for the safety of your data to make backups. I was asked to perform some data recovery on a hard drive of an old computer, which still contained important documents and photo's.

    The first thing I did was to make a disk image with ddrescue. I always work with the image and not with the original drive, to prevent any risk of accidentally messing things up for good.

    Example:

    ddrescue -r 2 -v if=/dev/sdf of=/storage/image/diskofperson.dd bs=1M

    Next, I tried using gparted on this file but got this error:

    Welcome to GNU Parted! Type 'help' to view a list of commands.

    (parted) p

    Error: /storage/image/diskofperson.dd: unrecognised disk label

    (parted) quit

    Also fdisk -l didn't work:

    Disk /storage/image/diskofperson.dd doesn't contain a valid partition table

    It seemed that the partition table was gone. I used the utility testdisk to recover this partition, to no avail. Why this tool didn't work is beyond me.

    I found a very old utility called 'gpart' that just searches a disk for existing partitions. I just want to know the starting offset of the relevant partition.

    So I ran:

    gpart -g /storage/image/diskofperson.dd

    And I got nothing useful, although a partition was found:

    Begin scan...

    Possible partition(DOS FAT), size(57255mb), offset(0mb)

    End scan.

    So I ran the command again with more verbosity:

    gpart -g /storage/image/diskofperson.dd

    ...

    Begin scan...

    Possible partition(DOS FAT), size(57255mb), offset(0mb)

    type: 011(0x0B)(DOS or Windows 95 with 32 bit FAT)

    size: 57255mb #s(117258372) s(63-117258434)

    chs: (1023/255/0)-(1023/255/0)d (0/0/0)-(0/0/0)r

    hex: 00 FF C0 FF 0B FF C0 FF 3F 00 00 00 84 38 FD 06

    End scan.

    ...

    This time I got something useful. The s(63-117258434) part shows the starting sector, which is 63. A sector is 512 bytes, so the exact starting offset of the partition is 32256.

    So to mount this partition, just issue:

    mount -o loop,ro,offset=32256 /storage/image/diskofperson.dd /mnt/recovery

    And voilá, access to the filesystem has been obtained.

    /storage/image/jdiskofperson.dd on /mnt/recovery type vfat (ro,loop=/dev/loop0,offset=32256)

Page 1 / 1