Discussion:
vdev distrobution
aurfalien
2013-08-06 01:53:12 UTC
Permalink
Hi all,

Is it important to align vdevs in a multi JBOD scenario?

What I have are 3 JBODs of 12 disks each.

Where 2 JBODs are connected to an LSI PCIe HBA card and the 3rd is connected to another respective LSI PICe HBA card.

All the HBAs are LSI SAS2308 6G cards and all disks are Seagate Constellation ES.2 3TB which are 6G.

Unsure if the card/disk info matters here but I thought to mention it.

I plan on creating 6 disks per vdev @RaidZ2 and stripe all 6 vdevs together.

Also plan on testing 12 disks per vdev @RaidZ3 and stripe all 3 vdevs together.

Nothing is concrete until I test my actual work load in determining which config works best.

Should I contain my vdevs to the disks in there respective JBODs or is it ok to create vdevs of disks from various JBODs?

What pitfalls could the latter config produce?

- aurf
Richard Elling
2013-08-06 02:19:05 UTC
Permalink
Post by aurfalien
Hi all,
Is it important to align vdevs in a multi JBOD scenario?
What I have are 3 JBODs of 12 disks each.
Where 2 JBODs are connected to an LSI PCIe HBA card and the 3rd is connected to another respective LSI PICe HBA card.
All the HBAs are LSI SAS2308 6G cards and all disks are Seagate Constellation ES.2 3TB which are 6G.
Unsure if the card/disk info matters here but I thought to mention it.
Nothing is concrete until I test my actual work load in determining which config works best.
Should I contain my vdevs to the disks in there respective JBODs or is it ok to create vdevs of disks from various JBODs?
For a single pool configuration, it is fine to spread the disks in a raidz* set across the JBODS.
This delivers the best diversity and with this many disks, it is easy to build JBOD failure resilience.
Post by aurfalien
What pitfalls could the latter config produce?
It will work just fine.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Sam Zaydel
2013-08-06 02:20:00 UTC
Permalink
Consider staying away from large VDEVs with such large drives. Resilvering
when disks fail will take for ever, especially if the pool is very busy.
You may be in a situation where weeks pass before a disk is fully
resilvered.
Post by aurfalien
Hi all,
Is it important to align vdevs in a multi JBOD scenario?
What I have are 3 JBODs of 12 disks each.
Where 2 JBODs are connected to an LSI PCIe HBA card and the 3rd is
connected to another respective LSI PICe HBA card.
All the HBAs are LSI SAS2308 6G cards and all disks are Seagate
Constellation ES.2 3TB which are 6G.
Unsure if the card/disk info matters here but I thought to mention it.
Nothing is concrete until I test my actual work load in determining which
config works best.
Should I contain my vdevs to the disks in there respective JBODs or is it
ok to create vdevs of disks from various JBODs?
What pitfalls could the latter config produce?
- aurf
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24342081-7731472e
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel

Anthropomorphic Yahweh makes about as much sense in a very as describing
Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this
means I am an atheist.



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-08-06 02:28:06 UTC
Permalink
Consider staying away from large VDEVs with such large drives. Resilvering when disks fail will take for ever, especially if the pool is very busy. You may be in a situation where weeks pass before a disk is fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
will see something like:
100% busy write on the resilvering disk
25% busy read on the surviving disks

Similarly, for a 6+2 raidz2 set:
100% busy write on the resilvering disk
12.5% busy read on the surviving disks

iostat -x will clearly show this.
-- richard
Hi all,
Is it important to align vdevs in a multi JBOD scenario?
What I have are 3 JBODs of 12 disks each.
Where 2 JBODs are connected to an LSI PCIe HBA card and the 3rd is connected to another respective LSI PICe HBA card.
All the HBAs are LSI SAS2308 6G cards and all disks are Seagate Constellation ES.2 3TB which are 6G.
Unsure if the card/disk info matters here but I thought to mention it.
Nothing is concrete until I test my actual work load in determining which config works best.
Should I contain my vdevs to the disks in there respective JBODs or is it ok to create vdevs of disks from various JBODs?
What pitfalls could the latter config produce?
- aurf
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24342081-7731472e
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel
Anthropomorphic Yahweh makes about as much sense in a very as describing Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this means I am an atheist.
illumos-zfs | Archives | Modify Your Subscription
--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Sam Zaydel
2013-08-06 02:41:09 UTC
Permalink
I think the argument that I would make is more disks per vdev means fewer
vdevs which in my experience results in more load in comparison to same
number of disks across a greater number of vdevs. This in turn, in my
observation translates to longer resilver periods.

S.
Post by Sam Zaydel
Post by aurfalien
together.
Consider staying away from large VDEVs with such large drives. Resilvering
when disks fail will take for ever, especially if the pool is very busy.
You may be in a situation where weeks pass before a disk is fully
resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will
clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
iostat -x will clearly show this.
-- richard
Post by aurfalien
Hi all,
Is it important to align vdevs in a multi JBOD scenario?
What I have are 3 JBODs of 12 disks each.
Where 2 JBODs are connected to an LSI PCIe HBA card and the 3rd is
connected to another respective LSI PICe HBA card.
All the HBAs are LSI SAS2308 6G cards and all disks are Seagate
Constellation ES.2 3TB which are 6G.
Unsure if the card/disk info matters here but I thought to mention it.
Nothing is concrete until I test my actual work load in determining which
config works best.
Should I contain my vdevs to the disks in there respective JBODs or is it
ok to create vdevs of disks from various JBODs?
What pitfalls could the latter config produce?
- aurf
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24342081-7731472e
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel
Anthropomorphic Yahweh makes about as much sense in a very as describing
Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this
means I am an atheist.
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24342081-7731472e> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
--
Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel

Anthropomorphic Yahweh makes about as much sense in a very as describing
Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this
means I am an atheist.



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-08-06 02:49:47 UTC
Permalink
I think the argument that I would make is more disks per vdev means fewer vdevs which in my experience results in more load in comparison to same number of disks across a greater number of vdevs. This in turn, in my observation translates to longer resilver periods.
Yes, the probability of failure (and resilver) goes up with more devices in a set.
This is where the MTTDL models that consider resilvering time really help. But if you
don't vary the number of disks, the probability of failure remains the same :-(
-- richard
S.
Consider staying away from large VDEVs with such large drives. Resilvering when disks fail will take for ever, especially if the pool is very busy. You may be in a situation where weeks pass before a disk is fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
iostat -x will clearly show this.
-- richard
Hi all,
Is it important to align vdevs in a multi JBOD scenario?
What I have are 3 JBODs of 12 disks each.
Where 2 JBODs are connected to an LSI PCIe HBA card and the 3rd is connected to another respective LSI PICe HBA card.
All the HBAs are LSI SAS2308 6G cards and all disks are Seagate Constellation ES.2 3TB which are 6G.
Unsure if the card/disk info matters here but I thought to mention it.
Nothing is concrete until I test my actual work load in determining which config works best.
Should I contain my vdevs to the disks in there respective JBODs or is it ok to create vdevs of disks from various JBODs?
What pitfalls could the latter config produce?
- aurf
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24342081-7731472e
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel
Anthropomorphic Yahweh makes about as much sense in a very as describing Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this means I am an atheist.
illumos-zfs | Archives | Modify Your Subscription
--
+1-760-896-4422
illumos-zfs | Archives | Modify Your Subscription
--
Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel
Anthropomorphic Yahweh makes about as much sense in a very as describing Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this means I am an atheist.
illumos-zfs | Archives | Modify Your Subscription
--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Timothy Coalson
2013-08-06 03:21:34 UTC
Permalink
Post by Sam Zaydel
Post by aurfalien
together.
Consider staying away from large VDEVs with such large drives. Resilvering
when disks fail will take for ever, especially if the pool is very busy.
You may be in a situation where weeks pass before a disk is fully
resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will
clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
Could you explain this? For large blocks, it should have to reconstruct
them entirely, which would give similar numbers of reads per disk as writes
to the resilvering disk (for 6+2, read 6 from 7 possible disks, write 1 to
1 disk), yes? Is resilvering always dominated by small blocks that don't
get a full stripe?

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-08-06 04:29:55 UTC
Permalink
Post by Richard Elling
Consider staying away from large VDEVs with such large drives. Resilvering when disks fail will take for ever, especially if the pool is very busy. You may be in a situation where weeks pass before a disk is fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Post by Richard Elling
Could you explain this? For large blocks, it should have to reconstruct them entirely, which would give similar numbers of reads per disk as writes to the resilvering disk (for 6+2, read 6 from 7 possible disks, write 1 to 1 disk), yes? Is resilvering always dominated by small blocks that don't get a full stripe?
Sure. The size of records has secondary or tertiary impact. The way to think about this is:
to recover data or parity block, you need to read enough of the other data and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.

To see this in data (iostat) see slides 197-198 http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real dataset on a large system.
-- richard
--
ZFS and performance consulting
http://www.RichardElling.com
aurfalien
2013-08-06 04:41:34 UTC
Permalink
Post by Richard Elling
Post by Richard Elling
Consider staying away from large VDEVs with such large drives. Resilvering when disks fail will take for ever, especially if the pool is very busy. You may be in a situation where weeks pass before a disk is fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Post by Richard Elling
Could you explain this? For large blocks, it should have to reconstruct them entirely, which would give similar numbers of reads per disk as writes to the resilvering disk (for 6+2, read 6 from 7 possible disks, write 1 to 1 disk), yes? Is resilvering always dominated by small blocks that don't get a full stripe?
to recover data or parity block, you need to read enough of the other data and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.
To see this in data (iostat) see slides 197-198 http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real dataset on a large system.
Possibly the key point here is that;

This info is freeee.

Love it!

Thanks Richard.

- aurf
Tim Cook
2013-08-06 04:42:00 UTC
Permalink
Post by aurfalien
together.
Post by Richard Elling
Post by Sam Zaydel
Consider staying away from large VDEVs with such large drives.
Resilvering when disks fail will take for ever, especially if the pool is
very busy. You may be in a situation where weeks pass before a disk is
fully resilvered.
Post by Richard Elling
Resilver time is mostly bound by the resilvering disk, which does not
change as you
Post by Richard Elling
go to more disks per set. For modern ZFS, the resilver throttle can also
impact the
Post by Richard Elling
resilvering time. But if you disable the throttle, the bottleneck will
clearly be the ability
Post by Richard Elling
of the resilvering disk to write. It is not unexpected that as the
number of disks in the
Post by Richard Elling
set increases, the work of the surviving disks is less. For example, a
4+1 raidz set
Post by Richard Elling
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Post by Richard Elling
Could you explain this? For large blocks, it should have to reconstruct
them entirely, which would give similar numbers of reads per disk as writes
to the resilvering disk (for 6+2, read 6 from 7 possible disks, write 1 to
1 disk), yes? Is resilvering always dominated by small blocks that don't
get a full stripe?
to recover data or parity block, you need to read enough of the other data
and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity
rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.
To see this in data (iostat) see slides 197-198
http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real
dataset on a large system.
-- richard
This seems counter to what every other manufacturer has to say on the
subject. The more drives in the raid-group, the larger the calculation for
parity should be, which should drive up the CPU cost for the rebuild and in
turn the rebuild time. We're not necessarily talking 40% hit, but 5-10%
isn't unheard of depending on the CPU in question if you're going from a
4+1 to something like an 18+2.

I haven't tested this specifically on ZFS but I can't see any reason it
would be exempt where others are not. Parity calculation is parity
calculation. If you were offloading to a custom ASIC of some sort that
would obviously be a different story, but not applicable here.

--Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-08-06 05:17:34 UTC
Permalink
Post by Richard Elling
Post by Richard Elling
Consider staying away from large VDEVs with such large drives. Resilvering when disks fail will take for ever, especially if the pool is very busy. You may be in a situation where weeks pass before a disk is fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not change as you
go to more disks per set. For modern ZFS, the resilver throttle can also impact the
resilvering time. But if you disable the throttle, the bottleneck will clearly be the ability
of the resilvering disk to write. It is not unexpected that as the number of disks in the
set increases, the work of the surviving disks is less. For example, a 4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Post by Richard Elling
Could you explain this? For large blocks, it should have to reconstruct them entirely, which would give similar numbers of reads per disk as writes to the resilvering disk (for 6+2, read 6 from 7 possible disks, write 1 to 1 disk), yes? Is resilvering always dominated by small blocks that don't get a full stripe?
to recover data or parity block, you need to read enough of the other data and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.
To see this in data (iostat) see slides 197-198 http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real dataset on a large system.
-- richard
This seems counter to what every other manufacturer has to say on the subject. The more drives in the raid-group, the larger the calculation for parity should be, which should drive up the CPU cost for the rebuild and in turn the rebuild time. We're not necessarily talking 40% hit, but 5-10% isn't unheard of depending on the CPU in question if you're going from a 4+1 to something like an 18+2.
XOR in modern CPUs runs at memory bandwidth rate. For raidz2 or RAID-6, the Reed-Solomon
code isn't that much more CPU-intensive. In both cases, much faster than HDD I/O by a few orders
of magnitude.
Post by Richard Elling
I haven't tested this specifically on ZFS but I can't see any reason it would be exempt where others are not. Parity calculation is parity calculation. If you were offloading to a custom ASIC of some sort that would obviously be a different story, but not applicable here.
Note: historically "hardware RAID" systems had wimpy CPUS. You'll hear a lot of old talk about
having to do XOR in specialized hardware. Those days are long gone. Modern arrays like the
VNX5500 have (single) 4-core Xeons @ 2.1GHz with (surprisingly) 3 DIMM slots. Even with that
modest CPU, it is plenty for parity reconstruction.
-- richard
--
ZFS and performance consulting
http://www.RichardElling.com
Timothy Coalson
2013-08-06 05:40:35 UTC
Permalink
Post by aurfalien
together.
Post by Richard Elling
Post by Sam Zaydel
Consider staying away from large VDEVs with such large drives.
Resilvering when disks fail will take for ever, especially if the pool is
very busy. You may be in a situation where weeks pass before a disk is
fully resilvered.
Post by Richard Elling
Resilver time is mostly bound by the resilvering disk, which does not
change as you
Post by Richard Elling
go to more disks per set. For modern ZFS, the resilver throttle can also
impact the
Post by Richard Elling
resilvering time. But if you disable the throttle, the bottleneck will
clearly be the ability
Post by Richard Elling
of the resilvering disk to write. It is not unexpected that as the
number of disks in the
Post by Richard Elling
set increases, the work of the surviving disks is less. For example, a
4+1 raidz set
Post by Richard Elling
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Post by Richard Elling
Could you explain this? For large blocks, it should have to reconstruct
them entirely, which would give similar numbers of reads per disk as writes
to the resilvering disk (for 6+2, read 6 from 7 possible disks, write 1 to
1 disk), yes? Is resilvering always dominated by small blocks that don't
get a full stripe?
to recover data or parity block, you need to read enough of the other data
and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity
rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.
Reading the parity and nothing else will not allow you to reconstruct
anything except in the trivial case where the number of data sectors is
less than or equal to the number of parity sectors (otherwise you have
invented fast compression of a constant ratio under unity for data of
arbitrary entropy, since if you can reconstruct any missing sector, you
could reconstruct if all were missing, right?). It is by reading the
parity along with the data that is still available that you can figure out
the missing piece (reconstruction on a full 6+2 stripe with 1 failure
combines 6 pieces of info to get 1 piece out, to match 100MB/s write, you
need 600MB/s aggregate read, which can be spread over the 7 healthy disks).

I thought each raidzn block was laid out with all parity sectors on the
same n devices, and it only "rotated" by having different blocks with
different alignment, which would mean that all reads/writes during
reconstruction would be linear within each block - thus I expected the IOPS
required on the disks to roughly match, too.

To see this in data (iostat) see slides 197-198
Post by aurfalien
http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real
dataset on a large system.
The per-device read iops are higher than the write iops, but the read busy
time is lower - assuming that was a steady state, I don't know what is
going on there, and I don't know whether that backs up your point or not.
However, the throughput is exactly as I would expect for reconstructing
large blocks - each device reading the same speed as the writing device.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Simon Casady
2013-08-06 15:28:44 UTC
Permalink
Richard, look at your slide again. about 5500k per disk read 5500k
written to the disk being resilvered, as it should be. Total io read 7x io
written. The more disks in the vdev the more total read bandwidth needed
while the write bandwidth stays the same. The likely limiting factor is
IOPS although I suppose with a really large number of disks in a vdev you
could saturate the controller or bus or whatever you have between the disks
and cpu.
Post by Timothy Coalson
Post by aurfalien
On Mon, Aug 5, 2013 at 9:28 PM, Richard Elling <
together.
Post by Sam Zaydel
Consider staying away from large VDEVs with such large drives.
Resilvering when disks fail will take for ever, especially if the pool is
very busy. You may be in a situation where weeks pass before a disk is
fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not
change as you
go to more disks per set. For modern ZFS, the resilver throttle can
also impact the
resilvering time. But if you disable the throttle, the bottleneck will
clearly be the ability
of the resilvering disk to write. It is not unexpected that as the
number of disks in the
set increases, the work of the surviving disks is less. For example, a
4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Could you explain this? For large blocks, it should have to
reconstruct them entirely, which would give similar numbers of reads per
disk as writes to the resilvering disk (for 6+2, read 6 from 7 possible
disks, write 1 to 1 disk), yes? Is resilvering always dominated by small
blocks that don't get a full stripe?
Sure. The size of records has secondary or tertiary impact. The way to
to recover data or parity block, you need to read enough of the other
data and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity
rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg
HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.
Reading the parity and nothing else will not allow you to reconstruct
anything except in the trivial case where the number of data sectors is
less than or equal to the number of parity sectors (otherwise you have
invented fast compression of a constant ratio under unity for data of
arbitrary entropy, since if you can reconstruct any missing sector, you
could reconstruct if all were missing, right?). It is by reading the
parity along with the data that is still available that you can figure out
the missing piece (reconstruction on a full 6+2 stripe with 1 failure
combines 6 pieces of info to get 1 piece out, to match 100MB/s write, you
need 600MB/s aggregate read, which can be spread over the 7 healthy disks).
I thought each raidzn block was laid out with all parity sectors on the
same n devices, and it only "rotated" by having different blocks with
different alignment, which would mean that all reads/writes during
reconstruction would be linear within each block - thus I expected the IOPS
required on the disks to roughly match, too.
To see this in data (iostat) see slides 197-198
Post by aurfalien
http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real
dataset on a large system.
The per-device read iops are higher than the write iops, but the read busy
time is lower - assuming that was a steady state, I don't know what is
going on there, and I don't know whether that backs up your point or not.
However, the throughput is exactly as I would expect for reconstructing
large blocks - each device reading the same speed as the writing device.
Tim
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24018577-4d8b86e0> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Pawel Jakub Dawidek
2013-10-18 19:43:44 UTC
Permalink
Post by Richard Elling
To see this in data (iostat) see slides 197-198 http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real dataset on a large system.
Very nice. We should definiately provide link to it at open-zfs.org.

One question, though. On page 35 where you illustrate RAIDZ layout
where is parity for D2:5?

BTW. I think dedupditto minimal value is 100, so you can't set 50 (page 95).
--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Timothy Coalson
2013-10-22 22:15:01 UTC
Permalink
Post by Richard Elling
Post by Richard Elling
To see this in data (iostat) see slides 197-198
http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Post by Richard Elling
Notice how this resilver sees 845 read IOPS per disk. This is a real
dataset on a large system.
Very nice. We should definiately provide link to it at open-zfs.org.
One question, though. On page 35 where you illustrate RAIDZ layout
where is parity for D2:5?
As no one else has replied, I'll answer the easy one - in that example, the
parities for block 2 are on disk D, as P2:0 and P2:1. P2:1 is the one that
applies to D2:4 and D2:5, if I have things right, while P2:0 applies to
D2:0-D2:3.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...