Dive deep into the OpenZFS 2.1 distributed RAID topologyID

dRAID vdevs can be quickly reused with additional capacity in place of additional disks.

On Friday afternoon, the OpenZFS project released version 2.1.0 of our favorite "always complicated but worth it" file system. The new version is compatible with FreeBSD 12.2-RELEASE and later and with Linux 3.10-5.13 kernels. This release introduces many general performance improvements as well as many completely new features - most of which are aimed at high-end businesses and applications.

Today we want to add the most popular OpenZFS 2.1.0 topology feature - vdev vRAID. dRAID has been in active development since at least 2015 and reached beta when it merged with OpenZFS master in November 2020. Since then, testing has been extensively tested at several major OpenZFS development stores - which means today's release is "fresh" to the case. production, not "new" because it has not been tested.

Distribution RAID (dRAID) Overview

Read More ZFS 101 - Understanding ZFS Storage and Performance If you previously thought ZFS architecture was a complex issue, prepare to be stunned. Distributed RAID (dRAID) is a brand new vdev architecture that we first encountered in a presentation at the OpenZFS 2016 Development Summit.

When creating a vRAID dRAID, the administrator defines the number of data, parity, and power storage segments in each strip. These numbers are independent of the number of physical vdev disks. We can see this in the following example taken from the dRAID Basic Concepts documentation:

root @box: ~ #zpool create mypool draid2: 4d: 1s: 11c wwn-0 wwn-1 wwn-2 .. wwn-A root @box: ~ # zpool status mypool pool: mypool state: ONLINE config: NAME STATE READ WRITE CKSUM repository ONLINE 0 0 0 draid2: 4d: 11c: 1s-0 ONLINE 0 0 0 wwn-0 ONLINE 0 0 0 wwn-1 online 0 0 0 wwn-2 connected 0 0 0 wwn-3 connected 0 0 0 wwn-4 connected 0 0 0 wwn-5 connected 0 0 0 wwn-6 connected 0 0 0 wwn-7 connected 0 0 0 wwn-8 connected 0 0 0 wwn-9 online 0 0 0 wwn-A online 0 0 0 spares draid2-0-0 AVAIL

dRAID topology

In the example above, we have eleven disks: wwn-0 to wwn-a. We created a dRAID vdev with 2 parity devices, 4 datasets, and one additional device per strip - in condensed terms, a driid2:4:1.

Even if we have eleven disks total in draid2:4:1, it is Use only six per data bar - and one per physical bar. In a world of full blanks, frictionless surfaces, and spherical chickens, the design on a tapered 2:4:1 disc looks like this:

0 1 2 3 4 5 6 7 8 9 A s PPDDDDPPDDD s DPPDDDDDDPDPDD s DDPPDDDDPPD s DDDPPDDDD . . s. . . . . . . . . . s. . . . . . . . . . s. . . . . . . . . . s. . . . . . . . . . s. . . . . . . . . . s. . . . . . . . . . s

Effectively, dRAID takes the RAID concept of "diagonal equality" one step further. The first RAID architecture was not RAID5 parity - it was RAID3 where parity was fixed on a single drive rather than distributed across the array. RAID5 eliminates the hard parity drive and distributes the balancing across array disks instead - making random writes much faster than RAID3, since not all writes are tight on a hard parity disk.

Advertise on all disks, rather than placing them all on one or two hard disks - and extending them to additional parts. In the event of a disk failure in a vdev dRAID, the parity and the pieces of data that live on the dead disk are copied to the additional partition(s) stored for each bad tape.

Let's take a look at the simplified diagram above, and see what happens if the disk leaves the array. The initial failure creates holes in most datasets (in this simple, wavy diagram):

0 1 2 4 5 6 7 8 9 A P P D D D P D D D D D P D D D D D P D D D D P D D D D P D D D P D D D P D D D D s. . . . .

But when we store, we do it on the additional capacity previously reserved:

0 1 2 4 5 6 7 8 9 A D P D D D D D D D D D D D P D D D P D D P D D D D D P D D P D P D P D D D P D D P D D P D D D D P D D P D D P D D. . . . .

Please note that these graphs are simplified. The full picture contains the groups, sections, and rows that we do not intend to include here. The logical schema is also randomly replaced so that things are evenly distributed across drives based on compensation. For those who are interested in the smallest details of hair, it is recommended to refer to these details described in the original code. If we are using 4kn disks, the draved2:4:1 vdev screen, as shown above, requires 24KB per block of metadata on the disk, while the widescreen RAIDz2 vdev six requires only 12KB. This difference gets worse as the d+p values ​​get higher - for draud2:8:1 for the same metadata block, it requires a large 40KB!

For this reason, special vdev allocation in pools using vdevs dRAID - when pool with driid2:8:1 and special triple storage requirement for 4KiB metadata blocks does it in only 12KB instead of 40KB in Derrida 2:8:1.

DRAID Functionality, Fault Tolerance and Recovery

This graph shows resistance times for a 90-disc pool, the dark blue line above is the resistance time hard fix On a hotspot disk; shows colored lines below reload time on additional distributed capacity. This graph shows the frequent times for the 90-disc pool. The dark blue line above is a constant over the time of the hot disk impedance reset. The colored lines below indicate that the elasticity time is distributed over the overcapacity. OpenZFS docs

In most cases, dRAID vdev works similarly to an equivalent set of traditional vdevs - for example, run 1:2:0 on roughly nine equivalent disks for a set of three RAVEz1 vdevs 3 wide. The fault tolerance is similar - you'll guarantee that you will fail with p = 1, just as you do with RAIDz1 vdevs.


Note that we said the error tolerance is the same and not the same. A traditional RAIDz1 video pool with three widths will surely only survive a single disk failure, but it will probably survive a second - as long as the second disk isn't damaged, it's not like vdev, everything is fine.

If this failure occurred before reinstalling, in the case of a 1:2 disk, not a disk, this would definitely throw away vdev (and its associated assembly). Since there is no fixed range for separate tapes, a second disk failure will likely cause additional partitions to be lost on previously damaged tapes, regardless of which disk fails in the second.

The fault tolerance is reduced somewhat as the retrofit time increases. In the above graph, we can see that on a 16TB node disk cluster, rebooking on a conventional and static spare takes about 30 hours, no matter how we configure dRAID vdev - but rebooking on the additional distributed capacity can take as little as 30 hours. hour.

This is mostly because reusing an additional distributed partition divides the write load among the remaining disks. When reinstalling on conventional spares, the backup disk itself is tight - readings are done from all vdev disks, but writing must be completed with additional components. But when the additional distributed capacity is used, the read and write loads are divided among all the remaining disks.

A partition-resistant material can be sequential elastic rather than optimization-resistant material - meaning that ZFS can easily copy most affected partitions without worrying about which blocks those partitions belong to? Conversely, the cure resistors have to scan the entire block tree - so there is a random read once, not a sequential read once.

When physical replacement of the damaged disk is added to the pool, this process will be optimized, not sequentially - and instead of the entire vdev, write performance to the disk will be restricted. But the timing of this operation is minimal, because vdev is not in a bad state at first.


RAID vdevs distribution is mostly intended for large storage servers. OpenZFS dump design and testing are rotated further on 90-disk systems. On a smaller scale, traditional vdev drives and parts remain as useful as ever.

We're particularly careful with stocking novices about draining - this is a much more complex design than a pool with traditional vdev hardware. Fast elasticity is great - but due to the necessarily constant ripple length, the deflection is multiplied by the pressure level and in some performance scenarios.

With the addition of regular tweaks without significantly improving performance, quick and accurate drilling may be desirable even on smaller systems - but you need to know exactly where to start with the ideal point. Also, please note that RAID is not a backup - and that includes fear!

Dive deep into the OpenZFS 2.1 distributed RAID topologyID
dive-deep-into-the-openzfs-2-1-distributed-raid.html Vaccines, reopening and worker rebellion: The great technological row is back in office

Vaccines, reopening and worker rebellion: The great technological row is back in office

CEOs want workers back to their desks. For employees and other virus programs. Across the United States, the leaders of tech giants like Apple, Google..., with the help of Google, squeezes malware, with the help of Google, squeezes malware

With a valid TLS certificate, faux Bravė.com can fool even the most secure of people. Malware that controls browsers and steals sensitive data.

... A privacy battle that Apple isn't fighting

A privacy battle that Apple isn't fighting

There are no browser-level privacy settings that California implements in Safari, iOS.

For at least a decade, privacy advocates have yearned ... Only 3G Kindles started their long and slow death this year

Only 3G Kindles started their long and slow death this year

3G 2021/2022 sunset affects even the eighth generation Kindle (2016).

On Wednesday, Amazon sent out an email notification to customers who pu... Huawei's latest flagship phone has HarmonyOS, Qualcomm SoC and lacks 5G

Huawei's latest flagship phone has HarmonyOS, Qualcomm SoC and lacks 5G

Faced with export bans and chip shortages, Huawei is ignoring what it can find.

Despite facing global chip shortage, US export ban and sharp ... Malicious PyPI packages steal developer data and inject code

Malicious PyPI packages steal developer data and inject code

The researchers warned that you should expect to see more malicious "Frankenstein" packages.

Open source packages estimated to have been down...