Richard, look at your slide again. about 5500k per disk read 5500k
written to the disk being resilvered, as it should be. Total io read 7x io
written. The more disks in the vdev the more total read bandwidth needed
while the write bandwidth stays the same. The likely limiting factor is
and cpu.
Post by Timothy CoalsonPost by aurfalienOn Mon, Aug 5, 2013 at 9:28 PM, Richard Elling <
together.
Post by Sam ZaydelConsider staying away from large VDEVs with such large drives.
Resilvering when disks fail will take for ever, especially if the pool is
very busy. You may be in a situation where weeks pass before a disk is
fully resilvered.
Resilver time is mostly bound by the resilvering disk, which does not
change as you
go to more disks per set. For modern ZFS, the resilver throttle can
also impact the
resilvering time. But if you disable the throttle, the bottleneck will
clearly be the ability
of the resilvering disk to write. It is not unexpected that as the
number of disks in the
set increases, the work of the surviving disks is less. For example, a
4+1 raidz set
100% busy write on the resilvering disk
25% busy read on the surviving disks
100% busy write on the resilvering disk
12.5% busy read on the surviving disks
should be 14.3% see below...
Could you explain this? For large blocks, it should have to
reconstruct them entirely, which would give similar numbers of reads per
disk as writes to the resilvering disk (for 6+2, read 6 from 7 possible
disks, write 1 to 1 disk), yes? Is resilvering always dominated by small
blocks that don't get a full stripe?
Sure. The size of records has secondary or tertiary impact. The way to
to recover data or parity block, you need to read enough of the other
data and parity blocks
to reconstruct. So if the set is 6+2, then you read from 7 disks (parity
rotates) to reconstruct
1 disk. If the media speed for the resilvering disk is 100 MB/sec (eg
HDD) then you need to
read 14.3 MB/sec from each of the surviving disks to generate 100 MB/sec of writes to the
resilvering disk. The IOPS measurements will follow a similar ratio.
Reading the parity and nothing else will not allow you to reconstruct
anything except in the trivial case where the number of data sectors is
less than or equal to the number of parity sectors (otherwise you have
invented fast compression of a constant ratio under unity for data of
arbitrary entropy, since if you can reconstruct any missing sector, you
could reconstruct if all were missing, right?). It is by reading the
parity along with the data that is still available that you can figure out
the missing piece (reconstruction on a full 6+2 stripe with 1 failure
combines 6 pieces of info to get 1 piece out, to match 100MB/s write, you
need 600MB/s aggregate read, which can be spread over the 7 healthy disks).
I thought each raidzn block was laid out with all parity sectors on the
same n devices, and it only "rotated" by having different blocks with
different alignment, which would mean that all reads/writes during
reconstruction would be linear within each block - thus I expected the IOPS
required on the disks to roughly match, too.
To see this in data (iostat) see slides 197-198
Post by aurfalienhttp://www.slideshare.net/relling/zfs-tutorial-lisa-2011
Notice how this resilver sees 845 read IOPS per disk. This is a real
dataset on a large system.
The per-device read iops are higher than the write iops, but the read busy
time is lower - assuming that was a steady state, I don't know what is
going on there, and I don't know whether that backs up your point or not.
However, the throughput is exactly as I would expect for reconstructing
large blocks - each device reading the same speed as the writing device.
Tim
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24018577-4d8b86e0> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f