About Pools

The Pool is a combination of storage devices grouped to provide performance and redundancy via RAID levels. A pool RAID level is defined when it is initially created and cannot be modified without destroying its data.

You must configure one or more data pools on a system to present storage to consumers via NFS or SMB. While there is no hard limit on the number of pools a system can have, usually fewer than four pools are configured on any given system.

BrickStor SP supports the following pool storage devices:

  • Mechanical hard drives

  • Solid state drives (SSD)

  • iSCSI raw volumes

  • Fibre Channel raw volumes

  • Virtual drives

From a systems administrator’s point of view, a pool is a logical organization of independent drives and contains all information about the devices comprising it, including structure, filesystems, raw volumes, replication target if any, etc. This information is encoded within its metadata, which makes it possible to easily migrate pools between systems. This feature also enables BrickStor’s high availability capabilities, which can move pools, as well as related network configuration, between nodes in the cluster.

Pool Types

This in software implementation allows for various parity schemes as well as mirroring configurations. The following are schemes currently supported by RackTop:

The following table explains the pool types that are available in BrickStor:

Table 1. Pool Types
Type Equivalent RAID Level Description

disk

RAID 0

No parity, fast, and total loss if a single device is lost. Useful for scratch-only data.

raidz1

RAID 50 / RAID 5+0

Single parity and allows for the loss of a single disk device in each drive group (VDEV).

raidz2

RAID 60 / RAID 6+0

Double parity and allows for the loss of two disk devices in each drive group (VDEV).

raidz3

RAID6+

Like RAIDZ2, but with more parity protection. Triple parity and allows for the loss of three disk devices in each group (VDEV).

mirror

RAID 10 / RAID 1+0

A stripe of mirrors, where two or more disk devices in a mirror are possible. Offers balance of performance and availability with a capacity trade-off.

Pool VDEV

VDEVs are the building blocks for a pool. A Pool VDEV, also know as a stripe, is a virtual device that can be a single disk, two or more disks that are mirrored, or a group of disks with a parity scheme such as RAID-5. The concept of a VDEV is something that abstracts away some unit of storage, which may or may not have any redundancy.

Pools are groups of virtual devices usually implemented with some data protection scheme, such as RAID or mirroring, on top of which filesystems and raw block devices are provisioned. A typical hybrid pool is a mix of mechanical drives and solid-state drives. In such a pool, data is redundantly stored on large capacity, slower, typically mechanical devices, arranged into a parity scheme that satisfies data protection as well as capacity and IOPS requirements, while high bandwidth, low latency solid state drives are used for the purposes of caching to accelerate reads and for the purposes of handling synchronous writes, enabling a much better cost to performance ratio over traditional purely mechanical, or purely solid state configurations. BrickStor also configures all flash pools, which continue to leverage RAM for cache solid state disks instead of mechanical disks to provide consistently lower latency and higher IOPS.

Pool Hierarchy and Containers

Pools include special containers that are used for organizing datasets and volumes so that they always reside within the same location within the pool.

  1. Global – Contains all the datasets and other containers except for the tenant containers on a Pool

  2. Volume Container – Contains all virtual block devices which are special datasets exposed over iSCSI

  3. Replication – Top level container for all incoming replication streams from other pools within the same BrickStor or other BrickStor’s

  4. Meta – Contains all of the user behavior audit data and the snapshot index data

Adaptive Replacement Cache

Adaptive Replacement Cache (ARC) is a portion of memory in the controller dedicated to caching recently accessed data. The ARC caches both recently written data, with the assumption that this data may be read soon after being written as well as recently read data, with the assumption that this data is potentially going to be read again. Depending on the popularity of data it may remain in the cache for a long time, or be evicted in favor of other data, based on criteria which both the user as well as the system can optimize for.

Read Cache

Optional SSD Cache device that can be used to extend the amount of data that is cached for Read operations. When data is evicted from the ARC it will potentially move to the L2ARC (based up on user configuration settings). Data read from L2ARC will be moved back into ARC.

Write Cache

RackTop uses a journal methodology for its write cache and is implemented in most systems as a mirrored SSD pair. A journal is both a software concept and a core physical component, a write ahead log that is used to reduce latency on storage when synchronous writes are issued by clients. RackTop frequently refers to journal as a ZIL, an intent log or a log device. In synchronous write cases, writes are committed to this journal and periodically pushed to primary storage. Journal guarantees that data is protected from loss on power failure due to being in cache before cache is flushed to stable storage.

A log device is normally only ever written to and never read from. A log device i.e. journal is present to protect the system from unexpected interruptions, such as power loss, a system crash, loss of storage connectivity, etc. In rare instances where recovery is necessary due to power loss or some other catastrophe, journal is read from in order to recreate a consistent state of the pool, which may require rolling back some transactions, but results in restoring the pool to a consistent state, unlike traditional storage systems where only best effort is promised. RackTop recommends mirroring journal devices as a means of preventing loss of a journal device, which has performance and potential availability impact. In all pools configured at the factory prior to system shipping, the journal, if present, will be mirrored.

Resilvering

Resilvering is the process of rebuilding a disk within a VDEV after a drive has been replaced. BrickStor OS does not have an fsck repair tool equivalent, common on Unix filesystems. Instead, the filesystem has a repair tool called "scan" which examines and repairs silent corruption and other problems. Scan can run while the volume is online; scan checks everything, including metadata and the data. This process works from the top down and only writes data to the disk that is needed. If a disk was temporarily offline it would only have to rebuild the data that was missed while the device was offline.

RAID Performance

BrickStor uses mirrors and RAID-Z for disk level redundancy within VDEVs.

RAIDZ

RAID-Z VDEVs are a variant of RAID-5 and RAID-6:

  • You can choose the number of data disks and the number of parity disks. Today, the number of parity disks is limited to 3 (RAID-Z3).

  • Each data block that is handed over to ZFS is split up into its own stripe of multiple disk blocks at the disk level, across the RAID-Z VDEV. This is important to keep in mind: Each individual I/O operation at the file system level will be mapped to multiple, parallel and smaller I/O operations across members of the RAID-Z VDEV.

  • When writing to a RAID-Z VDEV, ZFS will use a best fit algorithm when the VDEV is less than 90% full.

  • Write transactions in ZFS are always atomic, even when using RAID-Z: Each write operation is only finished if the überblock has been successfully written to disk. This means there’s no possibility to suffer from the traditional RAID-5 write hole, in which a power-failure can cause a partially (and therefore broken) written RAID-5 set of blocks.

  • Due to the copy-on-write nature of ZFS, there’s no read-modify-write cycle for changing blocks on disk: ZFS writes are always full stripe writes to free blocks. This allows ZFS to choose blocks that are in sequence on the disk, essentially turning random writes into sequential writes, maximizing disk write capabilities.

Just like traditional RAID-5 and RAID-6, you can lose up to 1 disk or 2 disks respectively without losing any data using RAID-Z1 and RAID-Z2. And just like ZFS mirroring, for each block at the file system level, ZFS can try to reconstruct data out of partially working disks, as long as it can find a critical number of blocks to reconstruct the original RAID-Z group.

This walkthrough primarily covers hardware-centric deployments, and may not represent effectivity with virtual deployments.

Performance of RAIDZ

When the system writes to a pool it writes to the VDEVs in a stripe. A Vdev in a RAID-Z configuration will have the IOPS and performance characteristics of the single slowest disk in that VDEV (it will not be a summation of the disks). This is because a read from disk requires a piece of data from every disk in the VDEV to complete the read. So, a pool with 3 VDEVs in a RAID-Z1 with 5 disks per vDEV will have the raw IOPS performance of 3 disks. You may see better performance than this through caching, but this is the most amount of raw IOPS the pool can deliver from disk. The more vdev’s in the pool the better the performance.

Performance of Mirrors

When the vdev’s are configured as mirrors the configuration of the pool is equivalent to RAID-10. A pool with mirrored vdev’s will always outperform other configurations. A read from disk only needs data from one disk in the mirror. As with RAID-Z, the more VDEVs the better performance will be. Resilver times with mirrored VDEVs will be faster than with RAID-Z and will have less of a performance impact on the overall system during resilvering. RackTop recommends the use of mirrored VDEVs in environments with high random IO such as virtualization because it provides the highest performance.

Compression

Compression is performed inline and at the block level. It is transparent to all other layers of the storage system. Each block is compressed independently and all-zero blocks are converted into file holes. To prevent “inflation” of already-compressed or incompressible blocks, BrickStor maintains a 12.5% compression ratio threshold below which blocks are written in uncompressed format. BrickStor supports compression via the LZJB, GZIP (levels 1-9), LZE, and LZ4. RackTop finds that LZ4 works very well, balancing speed and compression performance. It is common to realize a 1.3 to 1.6 compression ratio with highly compressible data which not only optimizes storage density but also improves write performance due to the reduction in disk IO. RackTop recommends always using compression because any CPU penalty is typically outweighed by the savings in storage and bandwidth to the disk.

Deduplication

Deduplication is performed inline and at the block level, also like compression, deduplication is transparent to all other layers of the storage system. For deduplication to work as expected the blocks written to the system must be aligned. Deduplication even when turned off will not reverse the deduplication of blocks already written to the system. This can only be accomplished through copying or moving the data. Deduplication negatively impacts the system performance if data is not significantly duplicative because an extra operation must be done to look if it is a duplicate block for writes and if it is the last block for deletes. Additionally, the deduplication table must be stored in RAM. This takes up space that could otherwise be used for metadata and caching. Should the deduplication not all fit in RAM then system performance will degrade sharply because every read and write operation will require the system to reread the dedup table from disk.

Deduplication is only supported on All SSD Pools.

Clones

ZFS clones create an active version of a snapshot. By creating a snapshot of a base VM and using clones of that same snapshot you can have an unlimited number of copies of the same base virtual machine without taking up more storage capacity. The only increased storage footprint will come from the deltas or differences between clones. Additionally, since each VM will reference the same set of base data blocks the system and user will benefit from caching since all VM’s will be utilizing the same blocks of data.

Imbalance of VDEV Capacity

If you wish to grow the capacity of a volume by adding another VDEV you should do so by adding a VDEV of equivalent size to the other VDEVs in the pool. If the other VDEVs are already past 90% capacity they will still be slow because data will not automatically balance or spread across all VDEVs after the additional capacity is added. To force a rebalance in a VMware environment you can perform a vmotion or storage migration. With the Copy On Write Characteristics of ZFS, the pool will automatically rebalance across all VDEVs.

Hot spares

A pool can have one or more hot spares to replace a faulted storage device automatically. A spare drive must be configured to become a hot spare.

For bare-metal BrickStor SP installations, it is a best practice to have at least one hot spare for each pool. Additionally, it is good practice to size # of hot spares based on # of drives in the pool.

Virtual machines generally do not benefit from using hot spares and RAID. They primarily rely on the resiliency and performance of the underlying NAS/SAN solution.

Pool Management

Hub Pool Management page provides a way to create new and administer existing pools.

To access Pool Management:

  1. Navigate to the Pools page.

    pools navigation

  2. Click the desired pool’s name in the list.

Pool Status

Pools health is relayed in status form as one of the following:

  • HA TRANSITIONING - The pool is in the middle of HA cluster failover to the other node.

  • EXPORTED - The pool is currently exported. In HA cluster configuration the pool is exported on both cluster nodes.

  • IMPORTED BY OTHER HA NODE - The pool is currently imported on the other HA cluster head node.

  • ONLINE - The pool is healthy and operating normally.

  • ONLINE : Warnings Detected - The pool is functioning normally, but contains warnings.

  • ONLINE : Errors Detected - The pool is functioning normally, but contains errors.

  • DEGRADED - One or more pool storage devices have failed, but the data is still available due to the additional parity.

  • FAULTED - One or more pool storage devices are faulted with insufficient replicas to continue functioning. The data is not accessible when pool is a faulted state.

  • UNAVAIL - The pool could not be opened.

  • SCANNING - The pool Scan is in progress.

    Scanning and Resilvering are I/O-intensive operations and can impact system performance.
  • RESILVERING - The pool is in the process of rebuilding the new drive from the remaining drives in the group. This is normal after a faulty drive is replaced.

Disk Device Details

Drives can be viewed in detail and managed by accessing the drive information pane located on any given pool.

To view the drive information pane:

  1. Navigate to a pool.

  2. Click the information icon located under the Structure section.

    drive information icon

  3. View the details and actions that can take place on the selected disk device.

    disk details pane

Actions will vary depending on the type of disk device selected.