In our last installment, I found that the combination of Windows Home Server as a NAS and Vista SP1 as a client could achieve read performance of almost 80 MB/s. But that impressive speed was reached only with filesizes that were smaller than the amount of RAM in both the client and NAS systems. Once caches were exceeded, performance fell to slightly under 60 MB/s, limited by the speed of a single SATA drive; WHS' key weakness.
I then made a detour into the low end to see how much performance could be squeezed from Intel's Atom mini-ITX board. I used both FreeNAS and Ubuntu Sever as OSes and found the latter able to get me up to read speeds of just shy of 60 MB/s with a RAID 0 array.
So I had high hopes of achieving significantly better performance when I installed the Ubuntu Server, mdadm and Webmin combination on the former WHS test bed. But try as I might, despite days of testing, I'm still far from the 100 MB/s read / write goal.
The problem isn't due to any limitation on the machine running iozone. It's a Core 2 Duo machine with a PCIe gigabit Ethernet connection that has been tested to read and write data at over 100 MB/s. And iozone doesn't access the system hard drive during testing, so that's not what is holding me back.
So I asked iozone's creator and filesystem expert Don Capps for any wisdom that he could shed on the subject. His response was illuminating, as usual:
RAID 0 can improve performance under some circumstances, specifically, sequential reads, with an operating system that does plenty of read-ahead. For this case, the read-ahead permits drives to overlap their seeks and pre-stage the data into the operating system's cache.
RAID 0 can also improve sequential write performance, with an application and an operating system that permits plenty of write-behind. In this case, the application is doing simple sequential writes and the operating system is completing the writes when the data is in its cache. The operating system can then issue multiple writes to the array and again overlap the seeks. The operating system may also combine small writes into larger writes.
The controller and disks may also support command tag queuing. This would permit multiple requests to be issued to the controller or disks and then the controller, or disks, could re-order the operations such that it reduces seek latencies.
These mechanisms can also occur in other RAID levels, but may be hampered by slowness in the XOR engine, controller or bus limitations, stripe wrap, parity disk nodalization, command tag queue depth, array cache algorithms or several other dozen factors...
Many of these clever optimizations may also have less effect as a function of file size. For example, write-behind effect might decrease as the file size exceeds the size of RAM in the operating system. Once this happens, the operating system's cache is full, and the array can not drain the cache as fast as the writer is trying to fill it.
In this case, performance deteriorates to a one-in / one-out series, or may deteriorate to a "sawtooth" function, if the operating system detects full cache and starts flushing in multi-block chunks. This permits the seek overlap to re-engage, but the full condition rapidly returns (over and over) as the writer can remain faster than the drives can receive the data.
Read-ahead effects can also fade. If one application is performing simple sequential reads, then read-ahead can overlap the disk seeks. But as more applications come online, the array can become saturated with read-ahead requests, causing high seek activity.
Finally, as the number of requesting applications rise, the aggregate data set can exceed the operating system's cache. When this happens, read-aheads complete into the operating system's cache. But before the application can use the read-ahead blocks, the cache might be reused by another application. In this case, the read-ahead performed by the disks was never used by the application, but it did create extra work for the disks. Once this threshold has been reached, the aggregate throughput for the array can be lower than for a single disk!
So it's important to understand that application workload can have a very significant impact on RAID performance. It's easy to experience a performance win in some circumstances. But it's also not so easy to prevent side effects or to keep the performance gain from fading away.
Don also provided some insight into how other RAID modes can affect filesystem performance.
RAID 1 is a "mirrored" set of disks, with no parity. A reader can go to disk 1 or disk 2 and it is up to the RAID controller to load balance the traffic. A writer must place the same data on both disks.
Write performance is the same or lower than that of a single drive. However, read performance might be improved, depending on how well the controller overlaps seeks.
Performance can be limited by disk 1, or disk 2 or disk 3. Parity is generated by doing an XOR of block 1 and block 2. This permits the reconstruction of either block 1 or block 2 by doing an XOR with the remaining good disk and the parity disk.
However, the XOR operation takes time and CPU resources. It can easily become a bottleneck unless there is a hardware assist for the XOR. Performance is also greatly improved by having multiple parity engines and NVRAM to permit asynchronous writes of the data blocks and the parity blocks.
RAID 10 (or RAID 1+0) is a nested-RAID array that combines striping and mirroring. It can provide improved data integrity (via mirroring), and improved read performance. But it requires additional disk space (four drives, minimum) and more sophistication in the RAID controller.
A reader can read block 1 from either disk 1 or disk 2, which permits load balancing. The striping permits more than 2 disks to participate in the mirroring.