Data Recovery Tales: Prepare The Right Way For RAID Failure

Photo of author

Tim Higgins

Those of us who have had to recover data from a failed RAID volume often ask what type of array to choose next. This isn’t the correct question, however. Array type, contrary to popular belief, is not that important. Generally, what to do next revolves more around your backup strategy and the tactics that will be used in your data storage plan.

In data storage, backup strategy involves:

  • Which data should be backed up (and which should not)
  • How often data is backed up
  • Whether version control is required
  • How fast the data is needed in case of failure

Tactics involve how a chosen backup strategy is implemented. For example, will it be online backup or just an external hard drive? Improper backup strategy as well as unsuitable tactics may come back to bite you if not thoughtfully chosen.

Backup strategy

Choice of backup strategy depends on data—its importance, amount, modification frequency and other characteristics. The strategy for storing a family photo archive is fundamentally different from the storage strategy of volatile files like application builds.

Common sense might dictate that everything should be backed up. But there are cases when having no backup is a valid decision. For example, we do not back up several RAID volumes storing a test set of files and folders. This data is of little value and can be recreated should the need arise.

Once backup is in place, RAID level does not matter because RAID is not backup! Instead, version control becomes a key concern. This is because RAID automatically and immediately propagates any changes in data, including errors, to all disks in the array. Without multiple versions of key data stored, your backup could be corrupted and useless.

Fail-stop

Any system that uses automatic replication of data, whether RAID or file-by-file replication, needs a fail-stop mechanism. A fail-stop system shuts down or switches to an inactive state when a failure occurs, preventing propagation of further errors to other systems / devices.

A hard drive bearing failure in a RAID disk, for example, is a fail-stop event. Once the drive stops spinning, it is no longer able to do anything else. The RAID controller will mark the drive offline and start to use parity reconstruction.

On the other hand, a controller cache memory failure is not fail-stop. The cache delivers wrong data, but the controller does not know that it is wrong. Wrong data propagates everywhere, corrupting all the redundant copies of data. The same applies to human error, because the storage system does not understand user intent, blindly following commands instead.

Even if your system is organized so that data is copied on schedule to another computer, which may even be located in other country, it is still effectively RAID1, with some delay between copies rather than backup. An example is this relatively recent case, where wrong data was "backed up" overwriting correct data.

Version Depth

The version depth, also known as retention time, is the amount of time that a copy of data is kept. Version depth must be more than the time between cases of real data usage. For example, if the most recent use of a file was in 2011, a copy from that time must be in the backup. If a user goes to access that file in 2013 and it won’t open, there won’t be a problem; the backup will be there.

A common error scenario is when the "Save" button is used on a master file (a document template, for example) when the "Save As" should have been used to create a new version. If user fails to notice the problem immediately, then the error will probably only be spotted when the original file is needed. To repeat: version depth / retention time must be long enough to provide the original version.

Scrubbing

Life is such that the longer you don’t do (or don’t check) something, the less chance that you will be able to do it next time. High-uptime systems are one good example of this. As the uptime increases, the chance to achieve a successful restart decreases.

The same is true for data storage systems. The longer you do not actively work with the data, for example do not open files and do not check data correctness, the less chance that the data is still there. Moreover, to avoid backing up data that is already corrupted, you should periodically check that the backup holds a copy that is still readable and useful.

Tactics

Choice of tactics is determined by the particular requirements for data storage. Generally, there are simple home solutions like an external hard drive to which data is copied periodically. The most common problem with this method is forgetfulness. Online backup services, such as Backblaze or CrashPlan, are a viable alternative. These are fully automatic and often come with version control built-in.

Once you have a suitable backup plan in place, you can safely use a RAID to reduce downtime. If you have experienced a RAID failure, it is important to find out why it has failed. More often than not, changing RAID level or switching controllers does nothing to prevent a reoccurrence of the original issue. The most common causes of RAID failure are operator error, errors during disk replacement or errors when working with RAID management software. RAID level has little effect in these cases.


Elena Pakhomova does marketing and development for data recovery software company ReclaiMe.com.

Related posts

Why Cache Matters in NAS Performance

To really understand NAS performance, you have to look end to end.

NAS Storage Expansion The Easy Way

You don't have to be a storage expert to create and manage high capacity storage pools.