A curious reader recently posed this question, which I've edited to remove references to specific products and companies:
I was hoping you can answer a question for me. As you know, TLER is important for RAID sets to prevent dropouts due to prolonged error recovery attempts.
However, if you look at hardware compatibility lists for [NASes], they include both enterprise and desktop (non-TLER-enabled) drives. What is also not clear is if the RAID implementation in NAS devices also support TLER.
My question then is, what is your take on this? Do ["prosumer" RAID NASes] support TLER? Is it worth spending the extra $$$ for TLER drives in these devices if not supported?
I pinged QNAP, Synology, NETGEAR and Buffalo to see what they had to say and I'll get to their responses shortly. But for those of you asking "What the heck is TLER?", let's first answer that question.
TLER or Time-Limited Error Recovery is Western Digital's method of improving drive error handling in RAID applications. Seagate has something similar—Error Recovery Control (ERC)—as do Samsung and Hitachi—Command Completion Time Limit (CCTL).
The aim of all of these techniques is to prevent drives from prematurely falling out of RAID arrays and forcing rebuilds, or worse, RAID volume loss.
All modern drives have include automatic error correction, such as the ability to handle write errors and work around bad blocks. During error recovery, however, the drive doesn't talk to its controller until it has corrected the encountered error and can return the data or finish the write. This is illustrated in Figure 1, taken from a Samsung CCTL white paper.
Figure 1: Normal and error drive responses
When this happens in non-RAID situations where there is no error recovery possible from nonexistent redundant data, the OS or device has no option but to wait for the drive to report back. If the drive takes a few (or dozens) of seconds to do this, you just see a short system hang or slowdown. If the drive never completes the read / write, you get a system hang and have to reboot, hopefully, with most of your data still there.
In RAID applications, however, you do have redundant data, which the RAID controller is managing. So if a drive takes too long handling a bad block, at some point the RAID controller will go read / write the data from another drive. The key point here is what it does with the drive that didn't respond in time.
Figure 2: RAID fail due to untimely drive response
Figure 2, also from the Samsung white paper, assumes that the RAID controller waits only 8 seconds (or 7 for CCTL) before it marks the drive as bad, enters degraded (or "parity") mode and flags that a drive needs replacing. In reality, this isn't the case for NASes, most of which use software RAID.
The responses I received from Synology, QNAP, NETGEAR and Buffalo all indicated that their NAS RAID controllers don't depend on or even listen to TLER, CCTL, ERC or any other similar error recovery signal from their drives. Instead, their software RAID controllers have their own criteria for drive timeouts, retries and when a drive is finally marked bad.
These software RAID controllers are generally more patient and wait significantly longer for drive response and execute more retries before finally giving up and marking a drive dead. While this may degrade performance slightly when dealing with drives with bad blocks, it's intended to reduce the occurrances of drives dropping out of RAID volumes and the subsequent long, risky rebuilds.
I say "risky" because rebuilds take many hours and sometimes days for the large 4TB+ volumes possible with today's even medium-range NASes. And every second that an array is rebuilding is a chance for one more error that will kill the entire volume. (So do yourself a favor and save disk-intensive activities like video re-encoding, heavy database use, etc. for after your RAID volume has rebuilt.)
So is there any benefit to using TLER / CCTL / ERC drives? Maybe. These features usually come on "Enterprise" grade drives (WD Caviar RE series, Seagate Barracuda ES, ES.2, Samsung Spinpoint F1), which are built to take the constant, hard use of business environments. So investing in these more expensive drives is probably a smart move if your NAS is under constant heavy use. But it will be the more robust drive construction and not TLER / CCTL / ERC that will make your RAID NAS more reliable.
For further reading:
- Wikipedia: Time-Limited Error Recovery
- WD Knowledge Base: What is the difference between Desktop edition and RAID (Enterprise) edition hard drives?
- WD Info Sheet (pdf): Time-Limited Error Recovery (TLER) Information Sheet
- Samsung White Paper: Command Completion Time Limit (CCTL) RAID Error Recovery