10 Gb ethernet is still quite expensive. You not only need to buy appropriate NICS, but you must also upgrade your network hardware as well. You may even need to replace existing fiber optic cabling if it's not rated for 10 Gbit.

So I decided to still just go for plain old 1 Gbit iSCSI based on copper for our backup SAN. After some research I went for the HP MSA P2000 G3 with dual 1 Gbit iSCSI controllers.

Each controller has 4 x 1 Gbit ports, so the box has a total of 8 Gigabit ports. This is ideal for redundancy, performance and cost. This relatively cheap SAN does support active/active mode, so both controllers can share the I/O load.

The problem with storage is that a single 1 Gbit channel is just not going to cut it when you need to perform bandwidth intensive tasks, such as moving VMs between datastores (within VMware).

Fortunately, iSCSI Multi Pathing allows you to do basically a RAID 0 over multiple network cards, combining their performance. So four 1 Gbit NICS can provide you with 4 Gbit of actual storage throughput.

The trick is not only to configure iSCSI Multi Pathing using regular tutorials, but also to enable the Round Robin setting on each data store or each RAW device mapping.

So I dit all this and still I got less than 1 Gb/s performance, but fortunately, there is only one little trick to get to the actual performance you might expect.

I found this at multiple locations but the explanation on Justin's IT Blog is best.

By default, VMware issues 1000 IOPS to a NIC before switching (Round Robin) to the next one. This really hampers performance. You need to set this value to 1.

esxcli storage nmp psp roundrobin deviceconfig set -d $DEV --iops 1 --type iops

This configuration tweak is recommended by HP, see page 28 of the linked PDF.

Once I configured all iSCSI paths to this setting, I got 350 MB/s of sequential write performance from a single VM to the datastore. That's decent enough for me.

How do you do this? It's a simple one liner that sets the iops value to 1, but I'm so lazy, I don't want to copy/past devices and run the command by hand each time.

I used a simple CLI script (VMware 5) to configure this setting for all devices. SSH to the host and then run this script:

for x in `esxcli storage nmp device list | grep ^naa`
do
    echo "Configuring Round Robin iops value for device $x"
    esxcli storage nmp psp roundrobin deviceconfig set -d $x --iops 1 --type iops
done

This is not the exact script I used, I have to verify this code, but basically it just configures this value for all storage devices. Devices that don't support this setting will raise an error message that can be ignored (if the VMware host also has some local SAS or SATA storage, this is expected).

The next step is to check if this setting is permanent and survives a host reboot.

Anyway, I verified the performance using a Linux VM and just writing a simple test file:

dd if=/dev/zero of=/storage/test.bin bs=1M count=30000

To see the Multi Pathing + Round Robin in action, run esxtop at the cli and then press N. You will notice that with four network cards, VMware will use all four channels available.

This all is to say that plain old 1 Gbit iSCSI can still be fast. But I believe that 10 Gbit ethernet does probably provide better latency. If that's really an issue for your environment, is something I can't tell.

Changing the IOPS parameter to 1 IOPS also seems to improve random I/O performance, according to the table in Justin's post.

Still, although 1 Gbit iSCSI is cheap, it may be more difficult to get the appropriate performance levels you need. If you have time, but little money, it may be the way to go. However, if time is not on your side and money isn't the biggest problem, I would definitely investigate the price difference with going for fibre channel or with 10Gbit iSCSI.

I use FIO to perform storage IO performance benchmarks. FIO does provide a script called "fio_generate_plots" which generates PNG or JPG based charts based on the data generated by FIO. The charts are created with GNUplot.

The "fio_generate_plots" didn't make me very happy as it didn't generate the kind of graphs I wanted. Furthermore, the script just contains some copy/pastes of the same blocks of code, slightly altered for the different benchmark types. I understand that the focus lies on FIO itself not some script to generate some fancy graphs, so don't get me wrong, but the script could be improved.

I used this script as the basis for a significantly reworked version, putting the code in a function that can be called with different parameters for the different benchmark types.

The result of this new script is something like this:

benchmark

You can download this new script here. This script requires GNUplot 4.4 or higher.

[b]Update 2013/05/26[/b]

I've submitted the script as a patch to the maintainers of FIO and it has been committed to the source tree. I'm not sure how this will work out but I assume that this script will be part of newer FIO releases.

Linode has released an update about the security incident first reported on April 12, 2013.

The Linode Manager is the environment where you control your virtual private servers and where you pay for services. This is the environment that got compromised.

Linode uses Adobe's ColdFusion as a platform for their Linode Manager application. It seems that the ColdFusion software was affected by two significant, previously unknown vulnerabilities that allowed attackers to compromise the entire Linode VPS management environment.

As the attackers had control over the virtual private servers hosted on the platform, they decided to compromise the VPS used by Nmap. Yes, the famous port scanner.

Fyodor's remark about the incident:

I guess we've seen the dark side of cloud hosting.

That's the thing. Cloud hosting is just an extra layer, an extra attack surface, that may provide an attacker with the opportunity to compromise your server and thus your data.

Even the author of Nmap, a person fairly conscious about security and aware of the risk of cloud-hosting, still took the risk to save a few bucks and some time setting something up himself.

If you are a Linode customer and consider becoming a former customer by fleeing to another cheap cloud VPS provider, are you really sure you are solving your problems?

When using cloud services, you pay less and you outsource the chores that come with hosting on a dedicated private server.

You also lose control over security.

Cloud hosting is just storing your data on 'Other People's Hard Drives. So the security of your stuff depends on those 'other people'. But did you ask those 'other people' for any information about how they tink to address risks like zero-days or other security threats? Or did you just consider their pricing, gave them your credit card and got on with your life?

If you left Linode for another cloud VPS provider, what assures you that they will do better? How do you know that they aren't compromised already right now? At this moment? You feel paranoid already?

We all want cheap hosting, but are you also willing to pay the price when the cloud platform is compromised?

I bought a Linode VPS for private usage just after the report that Linode had reset all passwords of existing users regarding the Linode management console.

Resetting passwords is not something you do when under a simple attack such as a DDOS attack. Such a measure is only taken if you suspect or have proof of a serious security breach. I should have known.

There are strong rumours that Linode has actually been hacked. Although I signed up for a Linode VPS after the attack, I still checked my creditcard for any suspicious withdrawals.

Linode is as of this writing very silent about the topic, which only fuels my, and every other's suspicion that something bad has happened.

Whatever happened, even it isn't as bad as it seems, such an incident as this should make you evaluate your choices about hosting your apps and data on cloud services.

I don't care that much about rumours that creditcard information may have been compromised. Although in itself quite damning, what I do care is about the security of the data stored in the virtual private servers hosted on their platform.

I like this phase: "There is no cloud, only Other People's Hard Drives".

Everybody uses cloud services, so we all put our data in the hands of some other third party and we just hope that they properly secured their environment.

The cynical truth is that even so, a case can be made that for many companies, data stored in the cloud or on a VPS is a lot safer than within their own company IT environment. But an incident like this may prove otherwise.

And if you believe that data on a VPS is more secure than within your own IT environment, I believe that you have more pressing problems. The thing is that it doesn't tell you anything about the security of those cloud solutions. It only tells you something about the perceived security of your own IT environment.

The cloud infrastructure is just another layer between the metal and your services, and it can thus be attacked. It increases the attack surface. It increases the risk of a compromise. The cloud doesn't make your environment more secure, on the contrary.

So anyway, who performs regular security audits of Linode or (insert your current cloud hosting provider?) and what is the quality of the processes that should assure security at all times?

Questions. Questions.

This incident again shows that you should clearly think about what kind of security your company or customer data warrants. Is outsourcing security of your data acceptable?

Maybe, if security is an important factor, those cheap VPS hosts aren't that cheap after all. You may be better off creating your own private cloud on (rented or owned) dedicated servers and put a little bit more effort in it.

Building your own environment on your own equipment is more expensive than just a simple VPS, but you are much more in control regarding security.

There is a fundamental difference between a read operation and a write operation. Storage can lie about completing a write operation, but it can never lie about completing a read operation. Therefore read and writes have different characteristics. This is what I've learned.

About writes

So what does this mean? Well, if you write data to disk, the I/O subsystem only has to acknowledge that it has written the data to the actual medium. Basically, the application says "please write this data to disk" and the I/O subsystem answers "done, feel free to give me another block of data!".

But the application cannot be sure that the I/O subsystem actually wrote that data to disk. More likely, the application can be sure the I/O subsystem lied.

Compared to RAM, non-volatile storage like hard-drives are slow. Orders of magnitudes slower. And the worst-case scenario, which is often also the real-life scenario, is that both read and write patterns are random as perceived from the storage I/O subsystem.

So you have this mechanical device with rotating platters and a moving arm, governed by Newtons rules of physics, trying to compete with CPUs and memory that are so small that they are affected by quantum mechanical effects. No way that device is going to be able to keep up with that.

So the I/O subsystem cheats. Hard drives are relatively great at reading and writing blocks of data sequentially, it's the random access patterns that wreaks havoc on performance. So the trick is to lie to the application and collect a bunch of writes in a cache, in memory.

So, meanwhile, the I/O subsystem looks at the data to be written to disk, and reorders the write operations, so that it becomes as 'serialised' as possible. It tries to take into account all the latencies involved in moving the arm, timing that with the rotation of the platter and that kind of stuff.

A 7200 RPM hard drive can do only 75 IOPS with random access patterns, but that is a worst-case of worst-case scenario's. Real-life usage scenario's often allow for some optimalisation.

I used FIO to perform some random-IO performance benchmarks on different hard drive types and RAID configurations. It turns out that read performance was conform the 75 IOPS, but writes where in the thousands of IOPS, not a realistic figure. The operating system (Linux) employed heavy caching of the writes, lying to FIO about the actual IOPS being written to disk.

Thousands of IOPS sounds great, but you can only lie until your write cache is full. There comes a time when you have to actually deliver and write this data to disk. This is where you see large drops in performance, to almost zero IOPS.

Most of the time, this behaviour is overall beneficial to application performance, as long as the application usage patterns are often short bursts of data, that need to be written to disk. With more steady streams of data being written to disk in a random order, this might influence application responsiveness. The application might become periodically unresponsive as data is flushed from the cache to disk.

This write-caching behaviour is often desired, because by reordering and optimising the order of the write requests, the actual overall obtained random I/O write performance is often significantly higher than could be achieved by the disk subsystem itself.

If the disk subsystem is not just a single disk, but a RAID array, comprised of multiple drives, write-caching is often even more important to keep performance acceptable, especially for RAID arrays with parity, such as RAID 5 and RAID 6.

Write-back caching may help increase performance significantly, but it may come at a cost. As the I/O subsystem lies about data being written to disk, that data may get lost if the system crashes or loses power. There is a risk of data loss or data corruption. Only use write-back caching on equipment that is supported by battery backup units and a UPS. Due to the risks associated with write-back caching, there might be use cases where it might be advised not to enable it to retain data consistency.

About reads

The I/O subsystem can't lie to the application about writes. If the application asks the I/O subsystem "can I have the contents of file X", the I/O subsystem can't just say "well, yes, sure". It actually has to deliver that data. So any arbitrary write can be easily cached and written to disk in a more optimised way, reads may be harder. There is no easy way out, the I/O subsystem must deliver.

Where any arbitrary write can be cached, only a limited number of reads can be cached. Cache memory is relatively small compared to the storage of the disk subsystem. The I/O subsystem must be smart about which data needs to be cached.

More complex storage solutions keep track of 'hot spots' and keep that data cached. As a side note, such caching constructions can now also be found in consumer grade equipment: Apple's fusion drive uses the SSD as a cache and stores the data that is less frequently accessed on the HDD.

But in the end, regarding reads, chances are higher that data must be retrieved that is not stored in cache (cache miss) and thus the drives must do actual work. Fortunately, that work is not as 'expensive' as writes for RAID 5 or RAID 6 arrays.

Furthermore, reads can also be 'grouped' and serialised (increased queue depth) at the cost of latency to optimise them (setup a more sequential read access pattern for the disk subsystem) and achieve better performance. But again, at the cost of latency, thus responsiveness. That may or may not be a problem depending of the type of application.

Some remarks

If possible, it's better to try and avoid having to access the storage subsystem in the first place, if possible. Try and trow RAM memory at the problem. Buy systems with sufficient RAM memory, so that the entire database fits in RAM memory. A few years ago this was unthinkable, but 128 GB of RAM memory can be had for less than two thousand dollars.

If RAM isn't an option (dataset is too large) still try and put in as much RAM as possible. Also, try and see if server grade Solid State Drives (SSDs) are an option (always RAID 1 at least for redundancy!), although their cost may be an obstacle.

The gateway of last resort is the old trusted hard drive. If random I/O is really an issue, take a look at 15000 RPM or at least 10000 RPM SAS drives and a good RAID controller with loads of cache memory. In general, more drives or more 'spindles' equals more I/O performance.

You might encounter a situation where you want to add drives to increase I/O performance, not for the storage. More important: you may choose not to use that extra storage as it may decrease performance. Because if you put more data on a disk, the head must cover larger areas of the disk platter, increasing latency.

There are usecases where drives are intentionally under-partitioned to (artificially) increase the performance of the drives.

Next Page ยป

20 DISK 18 TERRABYTE NAS

Just for fun, I've build myself an 18 TB NAS based on Debian Linux, software RAID, 20 disks and a Norco 4020 case.

AD

Projects

Contact

Donate

If you find PPSS, WFS or LFS, usefull, consider a donation.

Categories

Archives