Steep write penalty when resilvering

Discussion:

Ray Van Dolson

2013-08-09 06:24:32 UTC

We have a system being used as a backup target. 239 2TB nearline SAS
disks, 16 RAIDZ3 vdevs of 15 disks each (vdev disks are distributed
across 15 JBODs). We don't have a log device for this setup as the
writes tend to be sequential streams from our media agents and as such
weren't really needed (that was the thinking anyway).

Resilver tunables currently set are:

set zfs:zfs_resilver_delay = 3
set zfs:zfs_resilver_min_time_ms = 1000

.. which based on my understanding actually throttle resilver more
than the defaults.

We recently had a failed disk and the resilver process took about 313
hours (1.1TB read -- the single pool on this system has about 239TB
used and 143TB free). During that time, write speed became incredibly
slow to the point that backups were not completing in their time
windows. When the resilver finally finished things returned to normal.

Any best practices here we could be following that we're not? Did some
reading around tonight and am thinking that perhaps a log device might
have helped minimize impact here even though our writes tend to come in
large, lengthy streams.

Further throttling the resilver process might also have been an option,
but 313 hours is alraedy a very long time for a disk rebuild.... maybe
we should have just paused all I/O activity (e.g. suspedned backup
jobs), tuned the resilver knobs to go fast and just minimized the time
required to get back to a normal state.

This is on NexentaStor so we're also working through support, but
wanted to throw it out here for feedback as well.

Thanks,
Ray

Richard Elling

2013-08-11 20:40:40 UTC

Permalink

Post by Ray Van Dolson
We have a system being used as a backup target. 239 2TB nearline SAS
disks, 16 RAIDZ3 vdevs of 15 disks each (vdev disks are distributed
across 15 JBODs). We don't have a log device for this setup as the
writes tend to be sequential streams from our media agents and as such
weren't really needed (that was the thinking anyway).
set zfs:zfs_resilver_delay = 3
set zfs:zfs_resilver_min_time_ms = 1000
.. which based on my understanding actually throttle resilver more
than the defaults.
We recently had a failed disk and the resilver process took about 313
hours (1.1TB read -- the single pool on this system has about 239TB
used and 143TB free). During that time, write speed became incredibly
slow to the point that backups were not completing in their time
windows. When the resilver finally finished things returned to normal.
Any best practices here we could be following that we're not? Did some
reading around tonight and am thinking that perhaps a log device might
have helped minimize impact here even though our writes tend to come in
large, lengthy streams.

It depends on whether the workload does sync writes or not. If not, then a
log device would be a waste of time.

To understand how the workload and resilver interacted, you need to
analyze the iostat data. Fortunately, NexentaStor does collect some iostat
data in a sqlite database. Unfortunately, they don't really give you a way
to get at it for analysis. And, perhaps more unfortunately, the data can require
cleansing before a complete analysis can be made.

For HDDs, it is easy to overrun their ability to respond consistently in time
to I/O requests. The tunable to look for is zfs_vdev_max_pending and you
can tune this in NexentaStor UIs. The default 10 is too big for HDDs. The
iostat data will show if this is a problem.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Ray Van Dolson

2013-08-13 12:14:25 UTC

Permalink

Per your zilstat tool:

***@red-str-nxc1-p1-HB:/export/home/admin# ./zilstat -M -t 15 -p datapool
TIME N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB
2013 Aug 13 04:38:24 2 0 0 4 0 1 109 76 1 32
2013 Aug 13 04:38:39 0 0 0 0 0 0 24 20 0 4
2013 Aug 13 04:38:54 1 0 0 1 0 0 14 0 2 12
2013 Aug 13 04:39:09 0 0 0 0 0 0 9 1 1 7
2013 Aug 13 04:39:24 0 0 0 0 0 0 5 0 1 4
2013 Aug 13 04:39:39 0 0 0 0 0 0 7 0 0 7
2013 Aug 13 04:39:54 0 0 0 0 0 0 8 0 2 6
2013 Aug 13 04:40:09 1 0 0 1 0 0 18 7 2 9
2013 Aug 13 04:40:24 1 0 0 3 0 1 27 3 0 24
2013 Aug 13 04:40:39 1 0 0 2 0 0 22 0 0 22
2013 Aug 13 04:40:54 0 0 0 0 0 0 6 1 0 5
2013 Aug 13 04:41:09 2 0 1 3 0 1 24 0 0 24
2013 Aug 13 04:41:24 1 0 0 2 0 0 19 1 0 18
2013 Aug 13 04:41:39 2 0 1 3 0 1 24 0 0 24
2013 Aug 13 04:41:54 0 0 0 0 0 0 7 1 5 1
2013 Aug 13 04:42:09 2 0 1 3 0 1 27 0 0 27
2013 Aug 13 04:42:24 0 0 0 1 0 0 15 0 3 12
2013 Aug 13 04:42:39 1 0 1 2 0 1 17 0 0 17
2013 Aug 13 04:42:54 0 0 0 0 0 0 8 0 1 7
2013 Aug 13 04:43:09 2 0 1 3 0 1 25 0 1 24
2013 Aug 13 04:43:24 0 0 0 1 0 1 15 0 2 13
2013 Aug 13 04:43:39 3 0 1 4 0 2 35 0 0 35
2013 Aug 13 04:43:54 0 0 0 1 0 0 12 0 2 10

(This is while there are about 140MB/sec of writes going on).

Doesn't look like a log device would help us -- at least if this is a
typical representation of our i/o workload.

Post by Ray Van Dolson
Further throttling the resilver process might also have been an option,
but 313 hours is alraedy a very long time for a disk rebuild.... maybe
we should have just paused all I/O activity (e.g. suspedned backup
jobs), tuned the resilver knobs to go fast and just minimized the time
required to get back to a normal state.
This is on NexentaStor so we're also working through support, but
wanted to throw it out here for feedback as well.
To understand how the workload and resilver interacted, you need to
analyze the iostat data. Fortunately, NexentaStor does collect some iostat
data in a sqlite database. Unfortunately, they don't really give you a way
to get at it for analysis. And, perhaps more unfortunately, the data can
require cleansing before a complete analysis can be made.
For HDDs, it is easy to overrun their ability to respond consistently in time
to I/O requests. The tunable to look for is zfs_vdev_max_pending and you
can tune this in NexentaStor UIs. The default 10 is too big for HDDs. The
iostat data will show if this is a problem.
-- richard

We'll dig further into this.

Thanks,
Ray