Fake boost of redundancy in raidzN on a limited number of drives/LUNs

Discussion:

Jim Klimov

2013-09-27 15:20:03 UTC

Hello all,

Seeding another discussion here that I hope would be interesting
and enlightening :)

As a drop of background, I am planning to build a home-NAS based
on 4*4Tb HDDs, and after some scare in blogs about ever-growing
probabilities of encountering an error as the disk and dataset sizes
grow while BER remains the same, and the resilver/scrub times might
increase indefinitely, I got myself wondering about this plan below:

I want to maximize my storage, i.e. consider a raidz1 with 3*4Tb
data plus 1*4Tb parity (4 disk total is a chassis limitation).
However, as I am scared by the "FUD" (imagined or real) around the
possibility of uncorrectable errors, I want to have more redundancy.

So I can slice my disks, i.e. have 8*2Tb slices, and organize them
as raidz2, or even make a raidz3 from 12*1.3Tb slices (3 per disk).
This spends the same amount of space for redundancy, gives the same
amount of user-data, and allows higher resilience than raidz1 against
single-sector-per-block errors (while having the same low resilience
against full-single-disk failures and replacement).

Of course, the tradeoff would be higher randomity of IO's probably
out of control scope for ZFS/sd-driver queuing and other optimizations,
so on HDDs generally any sort of IO performance would likely suck.
I might expect that on a home-NAS used to store household multimedia
primarily, with SSD-based L2ARC and RAM ARC for anything cacheable
(i.e. VM images, if any) this might be or not be a fatal performance
killer... On the other hand, I have experience with many small systems
where components of an rpool mirror and of an up-to-4-disk data pool
acceptably happily live on the same four hardware HDDs (though these
are indeed parts of different pools and a disk doesn't internally
compete to serve pieces of the same IO request to one pool).

For the sake of completeness, I'd ask the list members for real or
theoretical expectations (if anyone has evaluated such scenarios) of
general performance, reliability and rebuild/resilver/scrub times?

However, I do expect the general answer "this will tank on HDDs",
so the real interesting question is whether such layouts might be
benefitial on all-SSD pools (no/negligible random-IO latency) built
from just a few SSDs? Is there a grain of benefit here?

On a side-note, did anyone actually encounter single-sector errors
on SSDs (manifested as ZFS checksum mismatches) without any other
major problems with the device itself? :)

Thanks in advance for a constructive discussion,
//Jim Klimov

Richard Elling

2013-09-27 19:59:06 UTC

Permalink

Post by Jim Klimov
Hello all,
Seeding another discussion here that I hope would be interesting
and enlightening :)
As a drop of background, I am planning to build a home-NAS based
on 4*4Tb HDDs, and after some scare in blogs about ever-growing
probabilities of encountering an error as the disk and dataset sizes
grow while BER remains the same, and the resilver/scrub times might
I want to maximize my storage, i.e. consider a raidz1 with 3*4Tb
data plus 1*4Tb parity (4 disk total is a chassis limitation).
However, as I am scared by the "FUD" (imagined or real) around the
possibility of uncorrectable errors, I want to have more redundancy.
So I can slice my disks, i.e. have 8*2Tb slices, and organize them
as raidz2, or even make a raidz3 from 12*1.3Tb slices (3 per disk).
This spends the same amount of space for redundancy, gives the same
amount of user-data, and allows higher resilience than raidz1 against
single-sector-per-block errors (while having the same low resilience
against full-single-disk failures and replacement).

Yes, you can do this. IMHO it is a total waste of time. You are better off treating the disk as a FRU, not a slice as a FRU. KISS.

-- richard

Post by Jim Klimov
Of course, the tradeoff would be higher randomity of IO's probably
out of control scope for ZFS/sd-driver queuing and other optimizations,
so on HDDs generally any sort of IO performance would likely suck.
I might expect that on a home-NAS used to store household multimedia
primarily, with SSD-based L2ARC and RAM ARC for anything cacheable
(i.e. VM images, if any) this might be or not be a fatal performance
killer... On the other hand, I have experience with many small systems
where components of an rpool mirror and of an up-to-4-disk data pool
acceptably happily live on the same four hardware HDDs (though these
are indeed parts of different pools and a disk doesn't internally
compete to serve pieces of the same IO request to one pool).
For the sake of completeness, I'd ask the list members for real or
theoretical expectations (if anyone has evaluated such scenarios) of
general performance, reliability and rebuild/resilver/scrub times?
However, I do expect the general answer "this will tank on HDDs",
so the real interesting question is whether such layouts might be
benefitial on all-SSD pools (no/negligible random-IO latency) built
from just a few SSDs? Is there a grain of benefit here?
On a side-note, did anyone actually encounter single-sector errors
on SSDs (manifested as ZFS checksum mismatches) without any other
major problems with the device itself? :)
Thanks in advance for a constructive discussion,
//Jim Klimov
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Sam Zaydel

2013-09-27 20:29:27 UTC

Permalink

Jim,

I think an interesting question that comes to my mind is whether you are
actually gaining anything other than complexity? Would this configuration
be even more likely to fail? If a disk fails, chances are, even if only one
part failed, you are going to be replacing a disk anyway, you don't replace
parts of a disk drive (any more). Would this conceptual scheme buy you
anything at all when a disk begins to fail? I mean if you have to replace
whole disk, you may be loosing all the slices on that disk, so your parity
scheme is not likely to do you any good. Or so it seems to me.

S.

On Fri, Sep 27, 2013 at 12:59 PM, Richard Elling

Post by Richard Elling

Yes, you can do this. IMHO it is a total waste of time. You are better off
treating the disk as a FRU, not a slice as a FRU. KISS.
-- richard

https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89

Post by Jim Klimov
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24342081-7731472e
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Join the geek side, we have Ï!

Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-09-27 21:50:07 UTC

Permalink

Post by Sam Zaydel
Jim,
I think an interesting question that comes to my mind is whether you are
actually gaining anything other than complexity? Would this
configuration be even more likely to fail? If a disk fails, chances are,
even if only one part failed, you are going to be replacing a disk
anyway, you don't replace parts of a disk drive (any more). Would this
conceptual scheme buy you anything at all when a disk begins to fail? I
mean if you have to replace whole disk, you may be loosing all the
slices on that disk, so your parity scheme is not likely to do you any
good. Or so it seems to me.

Well, that's part of what I wanted to learn (or find out what people
think) when I asked the question.

I think there are a number of failure modes, basically two: failure of
the whole device (i.e. motor dead, chip burnt) or failure of its data
retention in some areas (scratched surface, on-wire noise making the
disk over-write random stuff at a random location which I think may
have happened on my older home-NAS some years ago, brownian movement
or outer-space radiation, etc.) with the latter not being part of the
whole device's oncoming death but rather statistic noise, a one-off
event.

In case of a full-disk replacement - yes, this scheme is an overly
complicated one with no benefits to redundancy and likely drawbacks
to performance.

But fake-RAIDing might in fact protect against one-time errors such as
a random-location overwrite or a crash-landing, because the disk head
mechanically can't be in many positions at once. Well, there are a
number of heads over several platters, but still - fixed places at a
fixed time. So in case of per-sector errors of whatever nature, such
a pool has more than one parity piece to secure my data - and more
time to actually find this error by eventual IO or scrubbing, while
data is not protected by losing the only parity piece (in raidz1).

Maybe (likely?) I am wrong in some of this logic; and would like to
be disproved if so :)

//Jim

Sam Zaydel

2013-09-27 22:03:21 UTC

Permalink

I suspect all the problems that you mentioned that are effectively some
random one off or, fairly randomly occurring non-one off even should be
covered by checksums, parity and perhaps copies=XX? So, I would think,
unless I am missing something else, this is an interesting idea but not
really one that is practical. Seems like an interesting experiment though.
Tough part is of course controlling for failure modes.

Post by Jim Klimov

Well, that's part of what I wanted to learn (or find out what people
think) when I asked the question.
I think there are a number of failure modes, basically two: failure of
the whole device (i.e. motor dead, chip burnt) or failure of its data
retention in some areas (scratched surface, on-wire noise making the
disk over-write random stuff at a random location which I think may
have happened on my older home-NAS some years ago, brownian movement
or outer-space radiation, etc.) with the latter not being part of the
whole device's oncoming death but rather statistic noise, a one-off
event.
In case of a full-disk replacement - yes, this scheme is an overly
complicated one with no benefits to redundancy and likely drawbacks
to performance.
But fake-RAIDing might in fact protect against one-time errors such as
a random-location overwrite or a crash-landing, because the disk head
mechanically can't be in many positions at once. Well, there are a
number of heads over several platters, but still - fixed places at a
fixed time. So in case of per-sector errors of whatever nature, such
a pool has more than one parity piece to secure my data - and more
time to actually find this error by eventual IO or scrubbing, while
data is not protected by losing the only parity piece (in raidz1).
Maybe (likely?) I am wrong in some of this logic; and would like to
be disproved if so :)
//Jim
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24342081-7731472e<https://www.listbox.com/member/archive/rss/182191/24342081-7731472e>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=24342081-16f0b054<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-09-27 22:54:49 UTC

Permalink

Post by Sam Zaydel
I suspect all the problems that you mentioned that are effectively some
random one off or, fairly randomly occurring non-one off even should be
covered by checksums, parity and perhaps copies=XX? So, I would think,
unless I am missing something else, this is an interesting idea but not
really one that is practical. Seems like an interesting experiment though.
Tough part is of course controlling for failure modes.

Post by Jim Klimov

Well, that's part of what I wanted to learn (or find out what people
think) when I asked the question.
I think there are a number of failure modes, basically two: failure of
the whole device (i.e. motor dead, chip burnt) or failure of its data
retention in some areas (scratched surface, on-wire noise making the
disk over-write random stuff at a random location which I think may
have happened on my older home-NAS some years ago, brownian movement
or outer-space radiation, etc.) with the latter not being part of the
whole device's oncoming death but rather statistic noise, a one-off
event.
In case of a full-disk replacement - yes, this scheme is an overly
complicated one with no benefits to redundancy and likely drawbacks
to performance.
But fake-RAIDing might in fact protect against one-time errors such as
a random-location overwrite or a crash-landing, because the disk head
mechanically can't be in many positions at once. Well, there are a
number of heads over several platters, but still - fixed places at a
fixed time. So in case of per-sector errors of whatever nature, such
a pool has more than one parity piece to secure my data - and more
time to actually find this error by eventual IO or scrubbing, while
data is not protected by losing the only parity piece (in raidz1).
Maybe (likely?) I am wrong in some of this logic; and would like to
be disproved if so :)
//Jim
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24342081-7731472e<https://www.listbox.com/member/archive/rss/182191/24342081-7731472e>
Modify Your Subscription: https://www.listbox.com/**member/?&id_**
secret=24342081-16f0b054 <https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

--
Join the geek side, we have Ï!
Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Gregg Wonderly

2013-09-28 02:38:44 UTC

Permalink

I've had a pair of machines, each with 4 pairs of 2TB drives in my house for the past 3 years. These machines use the same 8 device 2x4 SAS controllers that only do 3GB/s SATA-2. I put each half the paired drives on one side of the controllers so that if one controller dies, I still have access to a complete set of 4 disks as a pool. If the controller biffs all 4 drives on one side, I still have the other side to repair/resilver from. It has worked well for me. I use rsync to copy between the machines. That gives me 4 copies of 8TB of space to try and manage failure of drives, hardware etc. Sometimes, I take one of the machines on the road with me if I will be away from the house for a while so that I separate the two machines to try and maintain at least one working set of data.

At some point, I will trim this back to about 4TB of space. What I have on my machines is 5+ years of photos and HD video of my kids events in school/life, and it's pretty important to me. I, personally, don't like the thought of raidz for this kind of data, because I can lose a lot of data on a simultaneous, multi disk failure with raidz, whereas I feel that is less probable for mirrored pairs of smaller size.

I've had some issues in building these systems with not having enough airflow around the drives and losing many drives over the course of a fairly short period of time. That is the number one issue to pay attention to. Use fans, everywhere to keep the air moving!

I used this drive bay in building my boxes, http://www.newegg.com/Product/Product.aspx?Item=N82E16817994113, and the 3 in 2 version as well for a total of 4+4+3=11
drives in an 8 5-1/4" bay case. I put one drive inside the case for a total of 12 drives which gave me a mirrored root, 4x2 mirrored pool for my data, and 2 spare drive bays that I use for the mirrored array of 4 pairs.

Those bays have nice, large fans to keep air moving. I have large fans in the case. The controller I used was http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157, which is no longer available. I have a pair of "spare" controllers on the shelf if needed. I don't need large drives/space at this point, so I won't be worrying about moving to 3TB or 4TB drives and having to get new controllers that will support that size drives.

There are several different 8-bay or even a few more internal cases. I would suggest that you pick one that supports 120mm fans so that you can have a nice large amount of air moving with a quiet environment.

Finally, I'd suggest looking at using ZFS on Linux or FreeBSD to have a better opportunity to use controllers that are cost effective for you. I'm on Illumos, but the next time I rearrange my space, I will be switching to Linux for the simple reason that Illumos and Solaris is just not going to be a sustainable platform for the long term from my perspective. No one is really actively working on anything except odd-ball bug fixes. Usability issues on Solaris/Illumos are all over the map for many different things, and these problems do not exist on Linux where the user still seems to be part of the focus for most distros.

Gregg

In my experience, "one-off" transient errors do not occur so much more frequently than more serious "disk is dead" or "common upward component is dead" that the complexity you'd be adding pays dividends with the protection from those fairly infrequent events it protects against. As you say, the more deadly and common problems of entire disk death are not made any better by this setup. Also worth noting that some of the errors that such a setup would increase defense against are the sort that don't generally lead to total data loss even in their worst forms, merely loss of a file or two.
Especially in a home environment, I would suggest you go for raidz1 -- and regular BACKUPS. I never advise (and refuse to support) raidz1 used in enterprise space, due to the different in load.. I'm a pretty heavy home user, and yet my home SAN rarely manages to blip usage graphs above flatline, and then only for short periods. Raidz1 or mirrors + regular backups of important files is plenty safe for any home use I can imagine. The risk of second disk death while resilvering a first is still there in a home environment, but the combination of less other type of load & the honest fact that you have to have backups somewhere else anyway combined with the simple fact that the majority of important home data is not typically stuff that has a super time sensitive 'time to recover' value (weeks is probably fine, I could lose my array and I wouldn't bother to restore most of it for quite awhile) is to me, at least, an acceptable risk.
- Andrew
I suspect all the problems that you mentioned that are effectively some random one off or, fairly randomly occurring non-one off even should be covered by checksums, parity and perhaps copies=XX? So, I would think, unless I am missing something else, this is an interesting idea but not really one that is practical. Seems like an interesting experiment though. Tough part is of course controlling for failure modes.
Jim,
I think an interesting question that comes to my mind is whether you are
actually gaining anything other than complexity? Would this
configuration be even more likely to fail? If a disk fails, chances are,
even if only one part failed, you are going to be replacing a disk
anyway, you don't replace parts of a disk drive (any more). Would this
conceptual scheme buy you anything at all when a disk begins to fail? I
mean if you have to replace whole disk, you may be loosing all the
slices on that disk, so your parity scheme is not likely to do you any
good. Or so it seems to me.
Well, that's part of what I wanted to learn (or find out what people
think) when I asked the question.
I think there are a number of failure modes, basically two: failure of
the whole device (i.e. motor dead, chip burnt) or failure of its data
retention in some areas (scratched surface, on-wire noise making the
disk over-write random stuff at a random location which I think may
have happened on my older home-NAS some years ago, brownian movement
or outer-space radiation, etc.) with the latter not being part of the
whole device's oncoming death but rather statistic noise, a one-off
event.
In case of a full-disk replacement - yes, this scheme is an overly
complicated one with no benefits to redundancy and likely drawbacks
to performance.
But fake-RAIDing might in fact protect against one-time errors such as
a random-location overwrite or a crash-landing, because the disk head
mechanically can't be in many positions at once. Well, there are a
number of heads over several platters, but still - fixed places at a
fixed time. So in case of per-sector errors of whatever nature, such
a pool has more than one parity piece to secure my data - and more
time to actually find this error by eventual IO or scrubbing, while
data is not protected by losing the only parity piece (in raidz1).
Maybe (likely?) I am wrong in some of this logic; and would like to
be disproved if so :)
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24342081-7731472e
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Join the geek side, we have Ï!
Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription

Paul B. Henson

2013-09-28 03:37:30 UTC

Permalink

Post by Gregg Wonderly
be switching to Linux for the simple reason that Illumos and Solaris
is just not going to be a sustainable platform for the long term from
my perspective. No one is really actively working on anything except
odd-ball bug fixes.

Hmm, I believe github would disagree with you on that point:

https://github.com/illumos/illumos-gate/commits/master

Gregg Wonderly

2013-09-28 14:40:32 UTC

Permalink

I am not seeing updates pushed in my direction I guess. I think that there may be stuff going on, but it just doesn't feel like it's moving out and dealing with the issues that make Illumos seem so problematic to me. There is a difference between committing code and creating value and distributing in a vibrant community.

Gregg

Post by Paul B. Henson

https://github.com/illumos/illumos-gate/commits/master
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24100479-5c4b6e69
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-10-08 15:30:15 UTC

Permalink

Post by Gregg Wonderly
I am not seeing updates pushed in my direction I guess. I think that there may be stuff going on, but it just doesn't feel like it's moving out and dealing with the issues that make Illumos seem so problematic to me. There is a difference between committing code and creating value and distributing in a vibrant community.

Think of illumos more like "linux kernel" and you will understand. The value is being driven by the distributions -- SmartOS, OmniOS, NexentaStor, etc. These are the "debian", "Ubuntu", and "RedHat" of the Linux community. And like Linux, there are lots of other distributions that I've not mentioned, and lots of people using illumos in products or settings that don't talk about it, but focus instead on just using it to solve their business problems.

The problem here is one of branding and brand recognition, and you're right, illumos has a lot to do in this particular regard.

- Garrett

Post by Gregg Wonderly
Gregg

Post by Paul B. Henson

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com