Why ZFS?: Pools & Deduplication

Editor’s Note: Infortrend is the only manufacturer we know of using ZFS in a commercially-produced NAS. So we asked them if they would like to make the case for their selection of ZFS, especially since our testing showed the performance penalty that this filesystem is known for.

We realize that this is providing Infortrend a “bully pulpit” that could be viewed as self-promotion. So we welcome opposing fact-based opinions in the Forums or even as its own article(s).

Introduction

This is part two of a series that discusses ZFS and its applications for network attached storage servers. Previously, we did a quick overview of ZFS – explaining its main strengths such as ensuring total data integrity and also its weaknesses, such as CPU & RAM requirements. In this article, we will continue discussing two specific features of ZFS – Storage Pools and Data Deduplication.

Storage Pools

Traditional file systems are built around single storage devices, with volume managers, partitioning and provisioning used to manage storage space on multiple devices. ZFS fundamentally changes this aspect by using storage pools, which are comprised of the hard drives, partitions, files and other storage devices that are connected to the system.

Within a storage pool (or "zpool" as it is also referred to) there are "vdevs" that consist of files, hard drive partitions or even the entire hard drives themselves (the latter being the most recommended option). These vdevs can be configured in many different ways before being added to the zpool, such as with non-redundancy or mirroring depending on needs. By adding vdevs , zpools can be expanded at any time. Similar to a computer’s RAM, additional vdevs become automatically available without need for further configuration.

The benefits of storage pools are that they maximize storage space, speed and availability, while removing the complexities involved in volume managers. Storage pools can also contain hot spares that compensate for failing disks. However, individual vdevs should have redundancy, since if a single vdev is damaged, then the entire vpool will be lost.

Deduplication

Data deduplication is a safe and efficient method to optimize storage capacity, making it one of the key features in ZFS. Before writing data, ZFS generates a checksum of the data and then deletes any duplicates with matching checksums if they exist. This prevention of duplicate data saves space and improves system performance.

ZFS’s deduplication is an inline process – occurring when the data is written and not as a potentially timewasting post-process. ZFS’s innate data integrity measures also greatly reduce the likelihood that non-duplicate data will be corrupted. Moreover, data deduplication scales with the total size of the ZFS pool.

When used in an application where there is high potential for duplicated data, such as file sharing or email servers, the space for both saving storage space and improving performances can be quite significant.

The downside to data deduplication is the same as that of ZFS itself: CPU and memory requirements. In order for efficient implementation of data deduplication, some experts recommend between 1 and 2 GB of RAM for every 1 TB of storage. One solution to ensure the high-speed operation of data deduplication is to use SSDs to store the ZFS intent log and Adaptive Replacement Cache, as previously discussed in Part 1.

References for further reading:

William Chen is Director of SMB Product Strategy for Infortrend, a Taiwan-based manufacturer of high-performance networked storage systems.

Discuss this in the Forums

Introduction

Storage Pools

Deduplication

Related posts

Introducing SmallNetBuilder’s Price vs. Performance Charts

Data Recovery Tales: When Windows Storage Spaces Go Bad

NAS too slow? Try DAS