Ray Van Dolson
2013-08-09 06:24:32 UTC
We have a system being used as a backup target. 239 2TB nearline SAS
disks, 16 RAIDZ3 vdevs of 15 disks each (vdev disks are distributed
across 15 JBODs). We don't have a log device for this setup as the
writes tend to be sequential streams from our media agents and as such
weren't really needed (that was the thinking anyway).
Resilver tunables currently set are:
set zfs:zfs_resilver_delay = 3
set zfs:zfs_resilver_min_time_ms = 1000
.. which based on my understanding actually throttle resilver more
than the defaults.
We recently had a failed disk and the resilver process took about 313
hours (1.1TB read -- the single pool on this system has about 239TB
used and 143TB free). During that time, write speed became incredibly
slow to the point that backups were not completing in their time
windows. When the resilver finally finished things returned to normal.
Any best practices here we could be following that we're not? Did some
reading around tonight and am thinking that perhaps a log device might
have helped minimize impact here even though our writes tend to come in
large, lengthy streams.
Further throttling the resilver process might also have been an option,
but 313 hours is alraedy a very long time for a disk rebuild.... maybe
we should have just paused all I/O activity (e.g. suspedned backup
jobs), tuned the resilver knobs to go fast and just minimized the time
required to get back to a normal state.
This is on NexentaStor so we're also working through support, but
wanted to throw it out here for feedback as well.
Thanks,
Ray
disks, 16 RAIDZ3 vdevs of 15 disks each (vdev disks are distributed
across 15 JBODs). We don't have a log device for this setup as the
writes tend to be sequential streams from our media agents and as such
weren't really needed (that was the thinking anyway).
Resilver tunables currently set are:
set zfs:zfs_resilver_delay = 3
set zfs:zfs_resilver_min_time_ms = 1000
.. which based on my understanding actually throttle resilver more
than the defaults.
We recently had a failed disk and the resilver process took about 313
hours (1.1TB read -- the single pool on this system has about 239TB
used and 143TB free). During that time, write speed became incredibly
slow to the point that backups were not completing in their time
windows. When the resilver finally finished things returned to normal.
Any best practices here we could be following that we're not? Did some
reading around tonight and am thinking that perhaps a log device might
have helped minimize impact here even though our writes tend to come in
large, lengthy streams.
Further throttling the resilver process might also have been an option,
but 313 hours is alraedy a very long time for a disk rebuild.... maybe
we should have just paused all I/O activity (e.g. suspedned backup
jobs), tuned the resilver knobs to go fast and just minimized the time
required to get back to a normal state.
This is on NexentaStor so we're also working through support, but
wanted to throw it out here for feedback as well.
Thanks,
Ray