Discussion:
What causes a resilver to restart?
Karl Denninger via illumos-zfs
2014-09-15 03:12:51 UTC
Permalink
I've had a question for a good long time on this, going back a couple of
years.

For small disks I've seen this a few times, but on large raidz and
raidz2 (particularly) volumes with big (e.g. 4TB) disks I see it a
*lot*. I used to think it was correlated with I/O load, but no longer
do. I've never seen it happen with a mirror either -- only with raid
volumes. It has no correlation with the disk adapter involved or the
brand of disk.

Let's say we have a disk fail and replace it, or I intentionally do the
"rolling replace" deal to increase capacity of the pool. Ok, the system
starts to resilver. No problem. Except it will get 10, 20, 30% or
something into it and then restart from zero. Sometimes the first of
these will come at some impossibly-small percentage in (e.g. 2-3% complete.)

The second time it will go through what it did before MUCH faster,
almost as if it no longer has to seek around for the data to rebuild
data and interleaved parity (which seems rather odd to me); indeed it
will typically blast forward to the previous point at close to native
disk sequential I/O speed, and then it will go some further distance
forward and may do it again. Or it may complete.

At no time do I get an error posted -- no read, no write, no checksum
errors, nothing in the system logs, nothing on the console about I/O
problems, the system never hangs or otherwise misbehaves, there's no
indication of any sort of problem at all. But my understanding is that
a resilver should not restart by itself ever -- if something goes wrong
it should actually *fail* and error out, not start over. There's no
correlation with, for example, a snapshot being created or removed
either -- I do have a script that runs every 4 hours that does rolling
snapshots so users can recover files they accidentally delete without
yelling for an admin to get them off a backup, but the times at which it
does this during the resilver do not correlate with the cron job running.

I've yet to have a resilver outright *fail* (unless the replacement disk
is bad, of course) but the thought has always been in the back of my
mind.... what if it *never* completes without restarting?

Should I be getting a logged message on this, and if not, how do I
figure out why it happens? This obviously has a nasty impact on the
time required to do the disk replace, particularly for large vdevs...

The systems in question are all FreeBSD and I've seen this all the way
back several major revisions to the OS, so plenty of revs back of ZFS as
well.

Thanks in advance.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Dan Vatca via illumos-zfs
2014-09-15 08:08:42 UTC
Permalink
I know that taking a snapshot during resilver will force the process to
restart. I do not know if it will restart immediately after taking the
snapshot, so you might want to check in your script that a resilver process
is running and postpone it.


On Mon, Sep 15, 2014 at 6:12 AM, Karl Denninger via illumos-zfs <
Post by Karl Denninger via illumos-zfs
I've had a question for a good long time on this, going back a couple of
years.
For small disks I've seen this a few times, but on large raidz and raidz2
(particularly) volumes with big (e.g. 4TB) disks I see it a *lot*. I used
to think it was correlated with I/O load, but no longer do. I've never
seen it happen with a mirror either -- only with raid volumes. It has no
correlation with the disk adapter involved or the brand of disk.
Let's say we have a disk fail and replace it, or I intentionally do the
"rolling replace" deal to increase capacity of the pool. Ok, the system
starts to resilver. No problem. Except it will get 10, 20, 30% or
something into it and then restart from zero. Sometimes the first of these
will come at some impossibly-small percentage in (e.g. 2-3% complete.)
The second time it will go through what it did before MUCH faster, almost
as if it no longer has to seek around for the data to rebuild data and
interleaved parity (which seems rather odd to me); indeed it will typically
blast forward to the previous point at close to native disk sequential I/O
speed, and then it will go some further distance forward and may do it
again. Or it may complete.
At no time do I get an error posted -- no read, no write, no checksum
errors, nothing in the system logs, nothing on the console about I/O
problems, the system never hangs or otherwise misbehaves, there's no
indication of any sort of problem at all. But my understanding is that a
resilver should not restart by itself ever -- if something goes wrong it
should actually *fail* and error out, not start over. There's no
correlation with, for example, a snapshot being created or removed either
-- I do have a script that runs every 4 hours that does rolling snapshots
so users can recover files they accidentally delete without yelling for an
admin to get them off a backup, but the times at which it does this during
the resilver do not correlate with the cron job running.
I've yet to have a resilver outright *fail* (unless the replacement disk
is bad, of course) but the thought has always been in the back of my
mind.... what if it *never* completes without restarting?
Should I be getting a logged message on this, and if not, how do I figure
out why it happens? This obviously has a nasty impact on the time required
to do the disk replace, particularly for large vdevs...
The systems in question are all FreeBSD and I've seen this all the way
back several major revisions to the OS, so plenty of revs back of ZFS as
well.
Thanks in advance.
--
Karl Denninger
*The Market Ticker*
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22253424-72ad1845> |
Modify
<https://www.listbox.com/member/?&>
Your Subscription <http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Kojedzinszky via illumos-zfs
2014-09-15 09:49:17 UTC
Permalink
I would argue with that. We have datasets which get new snapshots every
15 minutes. A scrub/resilver on the whole pool takes around 2 days, and
that would make a complete resilver not finish at all, but it indeed
does.
Post by Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the process
to restart. I do not know if it will restart immediately after taking
the snapshot, so you might want to check in your script that a
resilver process is running and postpone it.
On Mon, Sep 15, 2014 at 6:12 AM, Karl Denninger via illumos-zfs
Post by Karl Denninger via illumos-zfs
I've had a question for a good long time on this, going back a
couple of years.
For small disks I've seen this a few times, but on large raidz and
raidz2 (particularly) volumes with big (e.g. 4TB) disks I see it a
*lot*. I used to think it was correlated with I/O load, but no
longer do. I've never seen it happen with a mirror either -- only
with raid volumes. It has no correlation with the disk adapter
involved or the brand of disk.
Let's say we have a disk fail and replace it, or I intentionally do
the "rolling replace" deal to increase capacity of the pool. Ok, the
system starts to resilver. No problem. Except it will get 10, 20,
30% or something into it and then restart from zero. Sometimes the
first of these will come at some impossibly-small percentage in
(e.g. 2-3% complete.)
The second time it will go through what it did before MUCH faster,
almost as if it no longer has to seek around for the data to rebuild
data and interleaved parity (which seems rather odd to me); indeed
it will typically blast forward to the previous point at close to
native disk sequential I/O speed, and then it will go some further
distance forward and may do it again. Or it may complete.
At no time do I get an error posted -- no read, no write, no
checksum errors, nothing in the system logs, nothing on the console
about I/O problems, the system never hangs or otherwise misbehaves,
there's no indication of any sort of problem at all. But my
understanding is that a resilver should not restart by itself ever
-- if something goes wrong it should actually *fail* and error out,
not start over. There's no correlation with, for example, a snapshot
being created or removed either -- I do have a script that runs
every 4 hours that does rolling snapshots so users can recover files
they accidentally delete without yelling for an admin to get them
off a backup, but the times at which it does this during the
resilver do not correlate with the cron job running.
I've yet to have a resilver outright *fail* (unless the replacement
disk is bad, of course) but the thought has always been in the back
of my mind.... what if it *never* completes without restarting?
Should I be getting a logged message on this, and if not, how do I
figure out why it happens? This obviously has a nasty impact on the
time required to do the disk replace, particularly for large
vdevs...
The systems in question are all FreeBSD and I've seen this all the
way back several major revisions to the OS, so plenty of revs back
of ZFS as well.
Thanks in advance.
--
Karl Denninger
_The Market Ticker_
ILLUMOS-ZFS | Archives [1] [2] | Modify [3] Your Subscription
[4]
ILLUMOS-ZFS | Archives [1] [5] | Modify [6] Your Subscription [4]
------
[1] https://www.listbox.com/member/archive/182191/=now
[2] https://www.listbox.com/member/archive/rss/182191/22253424-72ad1845
[3] https://www.listbox.com/member/?&amp;
[4] http://www.listbox.com
[5] https://www.listbox.com/member/archive/rss/182191/25402478-0858cafa
[6]
https://www.listbox.com/member/?&
--
Richard Kojedzinszky
Paul Kraus via illumos-zfs
2014-09-15 13:34:18 UTC
Permalink
Post by Richard Kojedzinszky via illumos-zfs
I would argue with that. We have datasets which get new snapshots every
15 minutes. A scrub/resilver on the whole pool takes around 2 days, and
that would make a complete resilver not finish at all, but it indeed does.
There was a time when taking a snapshot would cause a resilver to
restart. That was fixed a long time ago. Looking at zpool upgrade -v on
FreeBSD 10.0 I am guessing that it was fixed in zpool version 11, I know
that a resilver did NOT restart when taking a snapshot as of version 15.
I was running a large (at the time) storage server under Solaris 10 and
zpool 15. We ran into one of the bugs that triggered the fixes in zpool
26 :-(
Post by Richard Kojedzinszky via illumos-zfs
Post by Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the process
to restart. I do not know if it will restart immediately after taking
the snapshot, so you might want to check in your script that a
resilver process is running and postpone it.
--
--
Paul Kraus ***@kraus-haus.org
Co-Chair Albacon 2014.5 http://www.albacon.org/2014/
Matthew Ahrens via illumos-zfs
2014-09-15 15:31:58 UTC
Permalink
On Mon, Sep 15, 2014 at 1:08 AM, Dan Vatca via illumos-zfs <
Post by Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the process to
restart. I do not know if it will restart immediately after taking the
snapshot, so you might want to check in your script that a resilver process
is running and postpone it.
This has not been true for at least 5 years. For more details, see my old
blog post:

https://blogs.oracle.com/ahrens/entry/new_scrub_code

--matt
Post by Dan Vatca via illumos-zfs
On Mon, Sep 15, 2014 at 6:12 AM, Karl Denninger via illumos-zfs <
Post by Karl Denninger via illumos-zfs
I've had a question for a good long time on this, going back a couple of
years.
For small disks I've seen this a few times, but on large raidz and raidz2
(particularly) volumes with big (e.g. 4TB) disks I see it a *lot*. I used
to think it was correlated with I/O load, but no longer do. I've never
seen it happen with a mirror either -- only with raid volumes. It has no
correlation with the disk adapter involved or the brand of disk.
Let's say we have a disk fail and replace it, or I intentionally do the
"rolling replace" deal to increase capacity of the pool. Ok, the system
starts to resilver. No problem. Except it will get 10, 20, 30% or
something into it and then restart from zero. Sometimes the first of these
will come at some impossibly-small percentage in (e.g. 2-3% complete.)
The second time it will go through what it did before MUCH faster, almost
as if it no longer has to seek around for the data to rebuild data and
interleaved parity (which seems rather odd to me); indeed it will typically
blast forward to the previous point at close to native disk sequential I/O
speed, and then it will go some further distance forward and may do it
again. Or it may complete.
At no time do I get an error posted -- no read, no write, no checksum
errors, nothing in the system logs, nothing on the console about I/O
problems, the system never hangs or otherwise misbehaves, there's no
indication of any sort of problem at all. But my understanding is that a
resilver should not restart by itself ever -- if something goes wrong it
should actually *fail* and error out, not start over. There's no
correlation with, for example, a snapshot being created or removed either
-- I do have a script that runs every 4 hours that does rolling snapshots
so users can recover files they accidentally delete without yelling for an
admin to get them off a backup, but the times at which it does this during
the resilver do not correlate with the cron job running.
I've yet to have a resilver outright *fail* (unless the replacement disk
is bad, of course) but the thought has always been in the back of my
mind.... what if it *never* completes without restarting?
Should I be getting a logged message on this, and if not, how do I figure
out why it happens? This obviously has a nasty impact on the time required
to do the disk replace, particularly for large vdevs...
The systems in question are all FreeBSD and I've seen this all the way
back several major revisions to the OS, so plenty of revs back of ZFS as
well.
Thanks in advance.
--
Karl Denninger
*The Market Ticker*
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22253424-72ad1845> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify
<https://www.listbox.com/member/?&>
Your Subscription <http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-15 16:15:13 UTC
Permalink
Post by Matthew Ahrens via illumos-zfs
On Mon, Sep 15, 2014 at 1:08 AM, Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the
process to restart. I do not know if it will restart immediately
after taking the snapshot, so you might want to check in your
script that a resilver process is running and postpone it.
This has not been true for at least 5 years. For more details, see my
https://blogs.oracle.com/ahrens/entry/new_scrub_code
--matt
[Powered by Listbox] <http://www.listbox.com>
The restarts are related to memory pressure.

I have now figured out a way to force them on a test machine -- it
appears that somewhere in the resilver memory allocation code if it is
forced to wait for an allocation of memory instead of waiting it restarts.

I don't know what the codepath is that resilver uses in this regard; if
someone has a pointer to it I can try to chase it down.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
George Wilson via illumos-zfs
2014-09-15 19:10:31 UTC
Permalink
Post by Karl Denninger via illumos-zfs
Post by Matthew Ahrens via illumos-zfs
On Mon, Sep 15, 2014 at 1:08 AM, Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the
process to restart. I do not know if it will restart immediately
after taking the snapshot, so you might want to check in your
script that a resilver process is running and postpone it.
This has not been true for at least 5 years. For more details, see
https://blogs.oracle.com/ahrens/entry/new_scrub_code
--matt
[Powered by Listbox] <http://www.listbox.com>
The restarts are related to memory pressure.
I have now figured out a way to force them on a test machine -- it
appears that somewhere in the resilver memory allocation code if it is
forced to wait for an allocation of memory instead of waiting it restarts.
I don't know what the codepath is that resilver uses in this regard;
if someone has a pointer to it I can try to chase it down.
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
Can you provide some detail on how you came to that determination?
Normally a resilver restart may occur if you continue to try to do I/O
to the device that is "failing". I/Os to the failing device may succeed
for a while but if any of them fail it will do a "probe" of the device
which results in reopening the device. If this happens then the resilver
restart here:

vdev_open()
{
<snip>
/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for a
scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);

Thanks,
George





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-15 19:28:45 UTC
Permalink
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by Matthew Ahrens via illumos-zfs
On Mon, Sep 15, 2014 at 1:08 AM, Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the
process to restart. I do not know if it will restart immediately
after taking the snapshot, so you might want to check in your
script that a resilver process is running and postpone it.
This has not been true for at least 5 years. For more details, see
https://blogs.oracle.com/ahrens/entry/new_scrub_code
--matt
[Powered by Listbox] <http://www.listbox.com>
The restarts are related to memory pressure.
I have now figured out a way to force them on a test machine -- it
appears that somewhere in the resilver memory allocation code if it
is forced to wait for an allocation of memory instead of waiting it
restarts.
I don't know what the codepath is that resilver uses in this regard;
if someone has a pointer to it I can try to chase it down.
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you provide some detail on how you came to that determination?
Normally a resilver restart may occur if you continue to try to do I/O
to the device that is "failing". I/Os to the failing device may
succeed for a while but if any of them fail it will do a "probe" of
the device which results in reopening the device. If this happens then
vdev_open()
{
<snip>
/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for a
scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);
Thanks,
George
Sure.

I have a test machine here that has multiple revs on it I can boot
quickly (beadm is a godsend) and on unmodified 10.1-BETA1 FreeBSD (which
does not have either my or Steve's tweaks in it) it is pretty trivial to
drive the system into a condition where it "stalls" by applying the
following load profile to it:

1. Sync I/O to a pool (small program that loops on doing an lseek() to a
random location in a file and then writes a block of random data,
simulating the atomic actions taken by a dbms.) This does fairly-severe
violence to the total I/O throughput capacity of the pool.

2. A second program that grabs a bunch of wired memory and dirties it,
to provide RAM pressure (Postgres does this, basically, for its internal
memory cache.)

3. A large file copy from a PC, using a full-disk backup program
(specifically Acronis) with the target on the same pool over a Gig-E
network.

This will very rapidly fill the ARC cache and, with unmodified code,
starts producing system "stalls". That is the environment that I've
been working with in reproducing and fixing that problem.

Just to make it "more fun" I added another stressor -- I intentionally
faulted one of the four members of the raidz2 pool that I/O was going
to, and then replaced the disk, forcing a full resilver. I noted during
that time that the resilver restarted several times, although there was
no I/O error logged either in the zpool or on the console.

Moving the arc's eviction threshold upward so as to make sure that
dirty_max cannot result in the system paging (that is, the arc eviction
threshold in free RAM + dirty_max exceeds pagedaemon's minimum free page
count to initiate paging) completely stops the resilver restarts (and
I/O stalls at the same time, incidentally.)

This, however, is not a strategy that FreeBSD can use generally, because
it results in the VM pager never waking up to scavenge cache and
inactive pages under normal conditions -- it will only do so under
extreme I/O stress when the dirty buffer count reaches near the
maximum. That only happens when I/O demand grossly exceeds what the
system is capable of sustaining. If, on a FreeBSD system, the ARC was
to leave a 10% of RAM reserve between in-use and paging thresholds (for
example), which is the default dirty_max value, over time ARC would be
slammed to the minimum and inactive pages will accumulate until that
happens, since pagedaemon never wakes up and scrubs them.

This is very repeatable on my test system; I do not need to reboot to
change the arc eviction settings with the sysctls I have in the test
machine. Unfortunately what I don't have is anything in the log
telling me *why* the resilver restarted, since there are no errors
logged of any sort. This does not impact a scrub -- I can hammer the
living daylights out of the system under the same test conditions with a
scrub running and that just chugs along -- it gets very slow in its
progress as you'd expect, but it continues onward.

How does the system determine that a device is "failing"? Is being
under pressure to the extent that I/O to it is materially delayed (but
*no* errors are returned) sufficient? If so that's a problem because
under severe memory pressure that's exactly what happens, but the I/O
does in fact complete.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-15 19:38:24 UTC
Permalink
Post by Karl Denninger via illumos-zfs
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by Matthew Ahrens via illumos-zfs
On Mon, Sep 15, 2014 at 1:08 AM, Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the
process to restart. I do not know if it will restart
immediately after taking the snapshot, so you might want to
check in your script that a resilver process is running and
postpone it.
This has not been true for at least 5 years. For more details, see
https://blogs.oracle.com/ahrens/entry/new_scrub_code
--matt
[Powered by Listbox] <http://www.listbox.com>
The restarts are related to memory pressure.
I have now figured out a way to force them on a test machine -- it
appears that somewhere in the resilver memory allocation code if it
is forced to wait for an allocation of memory instead of waiting it
restarts.
I don't know what the codepath is that resilver uses in this regard;
if someone has a pointer to it I can try to chase it down.
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4> |
Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you provide some detail on how you came to that determination?
Normally a resilver restart may occur if you continue to try to do
I/O to the device that is "failing". I/Os to the failing device may
succeed for a while but if any of them fail it will do a "probe" of
the device which results in reopening the device. If this happens
vdev_open()
{
<snip>
/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for
a scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);
Thanks,
George
Sure.
I have a test machine here that has multiple revs on it I can boot
quickly (beadm is a godsend) and on unmodified 10.1-BETA1 FreeBSD
(which does not have either my or Steve's tweaks in it) it is pretty
trivial to drive the system into a condition where it "stalls" by
1. Sync I/O to a pool (small program that loops on doing an lseek() to
a random location in a file and then writes a block of random data,
simulating the atomic actions taken by a dbms.) This does
fairly-severe violence to the total I/O throughput capacity of the pool.
2. A second program that grabs a bunch of wired memory and dirties it,
to provide RAM pressure (Postgres does this, basically, for its
internal memory cache.)
3. A large file copy from a PC, using a full-disk backup program
(specifically Acronis) with the target on the same pool over a Gig-E
network.
This will very rapidly fill the ARC cache and, with unmodified code,
starts producing system "stalls". That is the environment that I've
been working with in reproducing and fixing that problem.
Just to make it "more fun" I added another stressor -- I intentionally
faulted one of the four members of the raidz2 pool that I/O was going
to, and then replaced the disk, forcing a full resilver. I noted
during that time that the resilver restarted several times, although
there was no I/O error logged either in the zpool or on the console.
Moving the arc's eviction threshold upward so as to make sure that
dirty_max cannot result in the system paging (that is, the arc
eviction threshold in free RAM + dirty_max exceeds pagedaemon's
minimum free page count to initiate paging) completely stops the
resilver restarts (and I/O stalls at the same time, incidentally.)
This, however, is not a strategy that FreeBSD can use generally,
because it results in the VM pager never waking up to scavenge cache
and inactive pages under normal conditions -- it will only do so under
extreme I/O stress when the dirty buffer count reaches near the
maximum. That only happens when I/O demand grossly exceeds what the
system is capable of sustaining. If, on a FreeBSD system, the ARC was
to leave a 10% of RAM reserve between in-use and paging thresholds
(for example), which is the default dirty_max value, over time ARC
would be slammed to the minimum and inactive pages will accumulate
until that happens, since pagedaemon never wakes up and scrubs them.
This is very repeatable on my test system; I do not need to reboot to
change the arc eviction settings with the sysctls I have in the test
machine. Unfortunately what I don't have is anything in the log
telling me *why* the resilver restarted, since there are no errors
logged of any sort. This does not impact a scrub -- I can hammer the
living daylights out of the system under the same test conditions with
a scrub running and that just chugs along -- it gets very slow in its
progress as you'd expect, but it continues onward.
How does the system determine that a device is "failing"? Is being
under pressure to the extent that I/O to it is materially delayed (but
*no* errors are returned) sufficient? If so that's a problem because
under severe memory pressure that's exactly what happens, but the I/O
does in fact complete.
[Powered by Listbox] <http://www.listbox.com>
To add to the above: Is a device considered "failing" if, during a
resilver, I/O stacks up to the point that an attempt to allocate memory
for an I/O buffer fails and has to be retried? (I assume the request
doesn't simply block until memory is available; if it does then it's
back to just being slow.)

I've having trouble finding the path of execution that leads to the
restart.... I suppose I could put a trap in the code where the above is
kicked off; if it trips then I know the system thinks the device is
"failing" but that doesn't explain why it thinks so when there are no
errors.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-15 21:42:58 UTC
Permalink
Post by Karl Denninger via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by Matthew Ahrens via illumos-zfs
On Mon, Sep 15, 2014 at 1:08 AM, Dan Vatca via illumos-zfs
I know that taking a snapshot during resilver will force the
process to restart. I do not know if it will restart
immediately after taking the snapshot, so you might want to
check in your script that a resilver process is running and
postpone it.
This has not been true for at least 5 years. For more details,
https://blogs.oracle.com/ahrens/entry/new_scrub_code
--matt
[Powered by Listbox] <http://www.listbox.com>
The restarts are related to memory pressure.
I have now figured out a way to force them on a test machine -- it
appears that somewhere in the resilver memory allocation code if it
is forced to wait for an allocation of memory instead of waiting it
restarts.
I don't know what the codepath is that resilver uses in this
regard; if someone has a pointer to it I can try to chase it down.
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you provide some detail on how you came to that determination?
Normally a resilver restart may occur if you continue to try to do
I/O to the device that is "failing". I/Os to the failing device may
succeed for a while but if any of them fail it will do a "probe" of
the device which results in reopening the device. If this happens
vdev_open()
{
<snip>
/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for
a scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);
Thanks,
George
Sure.
I have a test machine here that has multiple revs on it I can boot
quickly (beadm is a godsend) and on unmodified 10.1-BETA1 FreeBSD
(which does not have either my or Steve's tweaks in it) it is pretty
trivial to drive the system into a condition where it "stalls" by
1. Sync I/O to a pool (small program that loops on doing an lseek()
to a random location in a file and then writes a block of random
data, simulating the atomic actions taken by a dbms.) This does
fairly-severe violence to the total I/O throughput capacity of the pool.
2. A second program that grabs a bunch of wired memory and dirties
it, to provide RAM pressure (Postgres does this, basically, for its
internal memory cache.)
3. A large file copy from a PC, using a full-disk backup program
(specifically Acronis) with the target on the same pool over a Gig-E
network.
This will very rapidly fill the ARC cache and, with unmodified code,
starts producing system "stalls". That is the environment that I've
been working with in reproducing and fixing that problem.
Just to make it "more fun" I added another stressor -- I
intentionally faulted one of the four members of the raidz2 pool that
I/O was going to, and then replaced the disk, forcing a full
resilver. I noted during that time that the resilver restarted
several times, although there was no I/O error logged either in the
zpool or on the console.
Moving the arc's eviction threshold upward so as to make sure that
dirty_max cannot result in the system paging (that is, the arc
eviction threshold in free RAM + dirty_max exceeds pagedaemon's
minimum free page count to initiate paging) completely stops the
resilver restarts (and I/O stalls at the same time, incidentally.)
This, however, is not a strategy that FreeBSD can use generally,
because it results in the VM pager never waking up to scavenge cache
and inactive pages under normal conditions -- it will only do so
under extreme I/O stress when the dirty buffer count reaches near the
maximum. That only happens when I/O demand grossly exceeds what the
system is capable of sustaining. If, on a FreeBSD system, the ARC
was to leave a 10% of RAM reserve between in-use and paging
thresholds (for example), which is the default dirty_max value, over
time ARC would be slammed to the minimum and inactive pages will
accumulate until that happens, since pagedaemon never wakes up and
scrubs them.
This is very repeatable on my test system; I do not need to reboot to
change the arc eviction settings with the sysctls I have in the test
machine. Unfortunately what I don't have is anything in the log
telling me *why* the resilver restarted, since there are no errors
logged of any sort. This does not impact a scrub -- I can hammer the
living daylights out of the system under the same test conditions
with a scrub running and that just chugs along -- it gets very slow
in its progress as you'd expect, but it continues onward.
How does the system determine that a device is "failing"? Is being
under pressure to the extent that I/O to it is materially delayed
(but *no* errors are returned) sufficient? If so that's a problem
because under severe memory pressure that's exactly what happens, but
the I/O does in fact complete.
[Powered by Listbox] <http://www.listbox.com>
To add to the above: Is a device considered "failing" if, during a
resilver, I/O stacks up to the point that an attempt to allocate
memory for an I/O buffer fails and has to be retried? (I assume the
request doesn't simply block until memory is available; if it does
then it's back to just being slow.)
I've having trouble finding the path of execution that leads to the
restart.... I suppose I could put a trap in the code where the above
is kicked off; if it trips then I know the system thinks the device is
"failing" but that doesn't explain why it thinks so when there are no
errors.
[Powered by Listbox] <http://www.listbox.com>
Interesting...

I put a kernel printf into the code you referenced above so if it
triggered I would get a line on the console....

/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for a
scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);
printf("Restart/start on apparent failed/init I/O\n");

The pool right now has one volume offline; the other three are up
(raidz2); I warmed up the cache, started the txg-syncing script from
Adam's blog here
(http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/) to see how much
dirty data was being allocated .vs. the usual 10% of RAM available and
then hammered the system as above. No reopens were logged, even though
I was able to run into the dirty_data_max limit.

Then I intentionally lowed dirty_data_max to throttle the I/O
"unnaturally" -- still no reopen was logged.

But as soon as I online'd the fourth disk in the pool (which needs
resilvered), I started getting reopens, they correlated with running
into dirty_data_max, and I could cause one "at will" by lowering the
value below the current dirty amount in use.

But here's the interesting thing -- the resilver didn't restart. At
least not right away (but I suspect a restart is in the offing, given
your description and the fact that the call was made to ask for it.)

This looks like pretty solid evidence that the resilver code thinks
deferred I/O due to tight memory = failing disk, triggering the reopen
and thus the restart.

Now there question is where in the code it's coming up with that
conclusion -- it's certainly appears wrong as there is no actual I/O
error that is taking place.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-16 00:28:40 UTC
Permalink
Post by Karl Denninger via illumos-zfs
Interesting...
I put a kernel printf into the code you referenced above so if it
triggered I would get a line on the console....
/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for a
scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);
printf("Restart/start on apparent failed/init I/O\n");
The pool right now has one volume offline; the other three are up
(raidz2); I warmed up the cache, started the txg-syncing script from
Adam's blog here
(http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/) to see how
much dirty data was being allocated .vs. the usual 10% of RAM
available and then hammered the system as above. No reopens were
logged, even though I was able to run into the dirty_data_max limit.
Then I intentionally lowed dirty_data_max to throttle the I/O
"unnaturally" -- still no reopen was logged.
But as soon as I online'd the fourth disk in the pool (which needs
resilvered), I started getting reopens, they correlated with running
into dirty_data_max, and I could cause one "at will" by lowering the
value below the current dirty amount in use.
But here's the interesting thing -- the resilver didn't restart. At
least not right away (but I suspect a restart is in the offing, given
your description and the fact that the call was made to ask for it.)
This looks like pretty solid evidence that the resilver code thinks
deferred I/O due to tight memory = failing disk, triggering the reopen
and thus the restart.
Now there question is where in the code it's coming up with that
conclusion -- it's certainly appears wrong as there is no actual I/O
error that is taking place.
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/26604278-6a3d91ec>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
That's not where the problem is coming from. That call is made
repeatedly from the async thread.

My thesis is blown as to the cause; back to start.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens via illumos-zfs
2014-09-16 00:43:08 UTC
Permalink
On Mon, Sep 15, 2014 at 5:28 PM, Karl Denninger via illumos-zfs <
Post by Karl Denninger via illumos-zfs
Interesting...
I put a kernel printf into the code you referenced above so if it
triggered I would get a line on the console....
/*
* If a leaf vdev has a DTL, and seems healthy, then kick off a
* resilver. But don't do this if we are doing a reopen for a
scrub,
* since this would just restart the scrub we are already doing.
*/
if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
vdev_resilver_needed(vd, NULL, NULL))
spa_async_request(spa, SPA_ASYNC_RESILVER);
printf("Restart/start on apparent failed/init I/O\n");
The pool right now has one volume offline; the other three are up
(raidz2); I warmed up the cache, started the txg-syncing script from Adam's
blog here (http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/) to see
how much dirty data was being allocated .vs. the usual 10% of RAM available
and then hammered the system as above. No reopens were logged, even though
I was able to run into the dirty_data_max limit.
Then I intentionally lowed dirty_data_max to throttle the I/O
"unnaturally" -- still no reopen was logged.
But as soon as I online'd the fourth disk in the pool (which needs
resilvered), I started getting reopens, they correlated with running into
dirty_data_max, and I could cause one "at will" by lowering the value below
the current dirty amount in use.
But here's the interesting thing -- the resilver didn't restart. At least
not right away (but I suspect a restart is in the offing, given your
description and the fact that the call was made to ask for it.)
This looks like pretty solid evidence that the resilver code thinks
deferred I/O due to tight memory = failing disk, triggering the reopen and
thus the restart.
Now there question is where in the code it's coming up with that
conclusion -- it's certainly appears wrong as there is no actual I/O error
that is taking place.
--
Karl Denninger
*The Market Ticker*
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/26604278-6a3d91ec> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like

dtrace -n 'spa_async_request:entry{stack()}'

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-09-16 01:01:21 UTC
Permalink
----- Original Message -----
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()

Regards
Steve
Karl Denninger via illumos-zfs
2014-09-16 02:31:49 UTC
Permalink
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but it's not
what's causing the restarts. :-)

(I stuck a dtrace probe in there for error returns, forced a restart of
a resilver and..... no error popped up.)

That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get tripped
by a predicate error out of zio_vdev_io_done (other than error 45, which
gets posted out on a fairly consistent basis for all the devices in the
system and appears to be a result of a capabilities probe.)
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
George Wilson via illumos-zfs
2014-09-16 03:26:34 UTC
Permalink
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but it's not
what's causing the restarts. :-)
(I stuck a dtrace probe in there for error returns, forced a restart
of a resilver and..... no error popped up.)
That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get
tripped by a predicate error out of zio_vdev_io_done (other than error
45, which gets posted out on a fairly consistent basis for all the
devices in the system and appears to be a result of a capabilities probe.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
Can you run 'dtrace -n dsl_resilver_restart:entry {stack();}'?

- George




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-16 03:40:28 UTC
Permalink
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but it's
not what's causing the restarts. :-)
(I stuck a dtrace probe in there for error returns, forced a restart
of a resilver and..... no error popped up.)
That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get
tripped by a predicate error out of zio_vdev_io_done (other than
error 45, which gets posted out on a fairly consistent basis for all
the devices in the system and appears to be a result of a
capabilities probe.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you run 'dtrace -n dsl_resilver_restart:entry {stack();}'?
- George
Sure -- I get this every 5-10 seconds.

dtrace: description 'dsl_resilver_restart:entry ' matched 1 probe
CPU ID FUNCTION:NAME
9 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e

3 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e

0 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e

7 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e

5 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e

...

But it doesn't restart every 10 seconds, of course. I'll see if I can
catch the actual event with something different in the probe output.

Here's the pool in question:

***@Dbms-10:/ # zpool status pool
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Sep 15 22:34:03 2014
28.6G scanned out of 706G at 97.8M/s, 1h58m to go
7.12G resilvered, 4.05% done
config:

NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
7403923050358881158 UNAVAIL 0 0 0 was
/dev/label/pool0/old
195715434397498954 OFFLINE 0 0 0 was
/dev/label/pool0
label/pool4 ONLINE 0 0 0 (resilvering)
label/pool1 ONLINE 0 0 0
label/pool2 ONLINE 0 0 0
label/pool3 ONLINE 0 0 0

errors: No known data errors
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-16 04:02:16 UTC
Permalink
Post by Karl Denninger via illumos-zfs
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but it's
not what's causing the restarts. :-)
(I stuck a dtrace probe in there for error returns, forced a restart
of a resilver and..... no error popped up.)
That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get
tripped by a predicate error out of zio_vdev_io_done (other than
error 45, which gets posted out on a fairly consistent basis for all
the devices in the system and appears to be a result of a
capabilities probe.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4> |
Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you run 'dtrace -n dsl_resilver_restart:entry {stack();}'?
- George
Sure -- I get this every 5-10 seconds.
dtrace: description 'dsl_resilver_restart:entry ' matched 1 probe
CPU ID FUNCTION:NAME
9 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
3 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
0 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
7 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
5 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
...
But it doesn't restart every 10 seconds, of course. I'll see if I can
catch the actual event with something different in the probe output.
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Sep 15 22:34:03 2014
28.6G scanned out of 706G at 97.8M/s, 1h58m to go
7.12G resilvered, 4.05% done
NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
7403923050358881158 UNAVAIL 0 0 0 was
/dev/label/pool0/old
195715434397498954 OFFLINE 0 0 0 was
/dev/label/pool0
label/pool4 ONLINE 0 0 0
(resilvering)
label/pool1 ONLINE 0 0 0
label/pool2 ONLINE 0 0 0
label/pool3 ONLINE 0 0 0
errors: No known data errors
Just got a restart, no difference in the log entries (the particular one
that was associated with the restart cannot be identified; they're all
the same.)
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
George Wilson via illumos-zfs
2014-09-16 13:28:33 UTC
Permalink
Post by Karl Denninger via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but it's
not what's causing the restarts. :-)
(I stuck a dtrace probe in there for error returns, forced a
restart of a resilver and..... no error popped up.)
That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get
tripped by a predicate error out of zio_vdev_io_done (other than
error 45, which gets posted out on a fairly consistent basis for
all the devices in the system and appears to be a result of a
capabilities probe.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you run 'dtrace -n dsl_resilver_restart:entry {stack();}'?
- George
Sure -- I get this every 5-10 seconds.
dtrace: description 'dsl_resilver_restart:entry ' matched 1 probe
CPU ID FUNCTION:NAME
9 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
3 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
0 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
7 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
5 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
...
But it doesn't restart every 10 seconds, of course. I'll see if I
can catch the actual event with something different in the probe output.
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Sep 15 22:34:03 2014
28.6G scanned out of 706G at 97.8M/s, 1h58m to go
7.12G resilvered, 4.05% done
NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
7403923050358881158 UNAVAIL 0 0 0 was
/dev/label/pool0/old
195715434397498954 OFFLINE 0 0 0 was
/dev/label/pool0
label/pool4 ONLINE 0 0 0
(resilvering)
label/pool1 ONLINE 0 0 0
label/pool2 ONLINE 0 0 0
label/pool3 ONLINE 0 0 0
errors: No known data errors
Just got a restart, no difference in the log entries (the particular
one that was associated with the restart cannot be identified; they're
all the same.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
Here's what's happening:

- zio_vdev_io_done() gets an error
- vdev_probe() gets called and requests an async probe by calling
spa_async_request(spa, SPA_ASYNC_PROBE);
- async thread calls spa_async_probe() which calls vdev_reopen() and
then vdev_open()
- vdev_open() requests an async resilver restart by calling
spa_async_request(spa, SPA_ASYNC_RESILVER);
- async thread calls dsl_resilver_restart() which schedules a restart at
the start of the next txg
- this triggers dsl_scan_sync() to restart the next time:

/*
* Check for scn_restart_txg before checking spa_load_state, so
* that we can restart an old-style scan while the pool is being
* imported (see dsl_scan_init).
*/
if (scn->scn_restart_txg != 0 &&
scn->scn_restart_txg <= tx->tx_txg) {
pool_scan_func_t func = POOL_SCAN_SCRUB;
dsl_scan_done(scn, B_FALSE, tx);
if (vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
func = POOL_SCAN_RESILVER;
zfs_dbgmsg("restarting scan func=%u txg=%llu",
func, tx->tx_txg);
dsl_scan_setup_sync(&func, tx);
}

In the end it's because we're doing I/O to a "failing" device and it's
periodically causing the resilver to restart.

- George





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-09-16 13:45:50 UTC
Permalink
----- Original Message -----
From: "George Wilson via illumos-zfs" <***@lists.illumos.org>
To: <***@lists.illumos.org>; "Karl Denninger" <***@denninger.net>
Sent: Tuesday, September 16, 2014 2:28 PM
Subject: Re: [zfs] What causes a resilver to restart?
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but it's
not what's causing the restarts. :-)
(I stuck a dtrace probe in there for error returns, forced a
restart of a resilver and..... no error popped up.)
That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get
tripped by a predicate error out of zio_vdev_io_done (other than
error 45, which gets posted out on a fairly consistent basis for
all the devices in the system and appears to be a result of a
capabilities probe.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you run 'dtrace -n dsl_resilver_restart:entry {stack();}'?
- George
Sure -- I get this every 5-10 seconds.
dtrace: description 'dsl_resilver_restart:entry ' matched 1 probe
CPU ID FUNCTION:NAME
9 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
3 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
0 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
7 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
5 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
...
But it doesn't restart every 10 seconds, of course. I'll see if I
can catch the actual event with something different in the probe output.
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Sep 15 22:34:03 2014
28.6G scanned out of 706G at 97.8M/s, 1h58m to go
7.12G resilvered, 4.05% done
NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
7403923050358881158 UNAVAIL 0 0 0 was
/dev/label/pool0/old
195715434397498954 OFFLINE 0 0 0 was
/dev/label/pool0
label/pool4 ONLINE 0 0 0 (resilvering)
label/pool1 ONLINE 0 0 0
label/pool2 ONLINE 0 0 0
label/pool3 ONLINE 0 0 0
errors: No known data errors
Just got a restart, no difference in the log entries (the particular
one that was associated with the restart cannot be identified; they're
all the same.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
- zio_vdev_io_done() gets an error
- vdev_probe() gets called and requests an async probe by calling
spa_async_request(spa, SPA_ASYNC_PROBE);
- async thread calls spa_async_probe() which calls vdev_reopen() and
then vdev_open()
- vdev_open() requests an async resilver restart by calling
spa_async_request(spa, SPA_ASYNC_RESILVER);
- async thread calls dsl_resilver_restart() which schedules a restart at
the start of the next txg
/*
* Check for scn_restart_txg before checking spa_load_state, so
* that we can restart an old-style scan while the pool is being
* imported (see dsl_scan_init).
*/
if (scn->scn_restart_txg != 0 &&
scn->scn_restart_txg <= tx->tx_txg) {
pool_scan_func_t func = POOL_SCAN_SCRUB;
dsl_scan_done(scn, B_FALSE, tx);
if (vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
func = POOL_SCAN_RESILVER;
zfs_dbgmsg("restarting scan func=%u txg=%llu",
func, tx->tx_txg);
dsl_scan_setup_sync(&func, tx);
}
In the end it's because we're doing I/O to a "failing" device and it's
periodically causing the resilver to restart.
I've yet to test, but given the information at hand I believe its could be a side effect
of TRIM requests returning ENOSUP which is triggering the unexpected_error case.

If so something along the lines of the following will fix it:-

Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 271518)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (working copy)
@@ -2740,7 +2740,8 @@ zio_vdev_io_done(zio_t *zio)
if (zio->io_error) {
if (!vdev_accessible(vd, zio)) {
zio->io_error = SET_ERROR(ENXIO);
- } else {
+ } else if (zio->io_type != ZIO_TYPE_FREE ||
+ zio->io_error != ENOSUP) {
unexpected_error = B_TRUE;
}
}

Regards
Steve
Karl Denninger via illumos-zfs
2014-09-16 15:59:15 UTC
Permalink
----- Original Message ----- From: "George Wilson via illumos-zfs"
Sent: Tuesday, September 16, 2014 2:28 PM
Subject: Re: [zfs] What causes a resilver to restart?
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by Karl Denninger via illumos-zfs
Post by George Wilson via illumos-zfs
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via
Post by Matthew Ahrens via illumos-zfs
Post by Karl Denninger via illumos-zfs
That's not where the problem is coming from. That call is made repeatedly
from the async thread.
My thesis is blown as to the cause; back to start.
Did you try using dtrace to see what the stack is when the async call is
dispatched? i.e. something like
dtrace -n 'spa_async_request:entry{stack()}'
From what I can tell there's only a single way into the there and thats from
the unexpected_error case from zio_vdev_io_done()
Regards
Steve
Well, that may be the only apparent way for it to happen, but
it's not what's causing the restarts. :-)
(I stuck a dtrace probe in there for error returns, forced a
restart of a resilver and..... no error popped up.)
That implies there's something else in the code that can cause a
resilver to restart -- something, it appears, that does not get
tripped by a predicate error out of zio_vdev_io_done (other than
error 45, which gets posted out on a fairly consistent basis for
all the devices in the system and appears to be a result of a
capabilities probe.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
Can you run 'dtrace -n dsl_resilver_restart:entry {stack();}'?
- George
Sure -- I get this every 5-10 seconds.
dtrace: description 'dsl_resilver_restart:entry ' matched 1 probe
CPU ID FUNCTION:NAME
9 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
3 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
0 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
7 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
5 53343 dsl_resilver_restart:entry
zfs.ko`spa_async_thread+0x198
kernel`fork_exit+0x9a
kernel`0xffffffff80cedd5e
...
But it doesn't restart every 10 seconds, of course. I'll see if I
can catch the actual event with something different in the probe output.
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool
will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Sep 15 22:34:03 2014
28.6G scanned out of 706G at 97.8M/s, 1h58m to go
7.12G resilvered, 4.05% done
NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
7403923050358881158 UNAVAIL 0 0 0 was
/dev/label/pool0/old
195715434397498954 OFFLINE 0 0 0 was
/dev/label/pool0
label/pool4 ONLINE 0 0 0
(resilvering)
label/pool1 ONLINE 0 0 0
label/pool2 ONLINE 0 0 0
label/pool3 ONLINE 0 0 0
errors: No known data errors
Just got a restart, no difference in the log entries (the particular
one that was associated with the restart cannot be identified;
they're all the same.)
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4> |
Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
- zio_vdev_io_done() gets an error
- vdev_probe() gets called and requests an async probe by calling
spa_async_request(spa, SPA_ASYNC_PROBE);
- async thread calls spa_async_probe() which calls vdev_reopen() and
then vdev_open()
- vdev_open() requests an async resilver restart by calling
spa_async_request(spa, SPA_ASYNC_RESILVER);
- async thread calls dsl_resilver_restart() which schedules a restart
at the start of the next txg
/*
* Check for scn_restart_txg before checking spa_load_state, so
* that we can restart an old-style scan while the pool is being
* imported (see dsl_scan_init).
*/
if (scn->scn_restart_txg != 0 &&
scn->scn_restart_txg <= tx->tx_txg) {
pool_scan_func_t func = POOL_SCAN_SCRUB;
dsl_scan_done(scn, B_FALSE, tx);
if (vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
func = POOL_SCAN_RESILVER;
zfs_dbgmsg("restarting scan func=%u txg=%llu",
func, tx->tx_txg);
dsl_scan_setup_sync(&func, tx);
}
In the end it's because we're doing I/O to a "failing" device and
it's periodically causing the resilver to restart.
I've yet to test, but given the information at hand I believe its could be a side effect
of TRIM requests returning ENOSUP which is triggering the
unexpected_error case.
If so something along the lines of the following will fix it:-
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision
271518)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (working copy)
@@ -2740,7 +2740,8 @@ zio_vdev_io_done(zio_t *zio)
if (zio->io_error) {
if (!vdev_accessible(vd, zio)) {
zio->io_error = SET_ERROR(ENXIO);
- } else {
+ } else if (zio->io_type != ZIO_TYPE_FREE ||
+ zio->io_error != ENOSUP) {
unexpected_error = B_TRUE;
}
}
Regards
Steve
It is coming from TRIM requests; the system flags the device as
incapable on the first batch of them, but that falls through the io_done
test and as a result it triggers a reopen -- that clears the flag.

I believe Steve's patch will work; what I wrote and put in does the same
thing:


--- zio.c (revision 271435)
+++ zio.c (working copy)
@@ -2741,7 +2741,17 @@
if (!vdev_accessible(vd, zio)) {
zio->io_error = SET_ERROR(ENXIO);
} else {
- unexpected_error = B_TRUE;
+ /*
+ * If an unsupported TRIM was attempted this is not a vdev-
+ * restarting event. KD 2014-09-16
+ */
+ DTRACE_PROBE2(vdev__io_done, uint64_t,
zio->io_error, uint64_t, zio->io_type);
+ if (zio->io_error != EOPNOTSUPP ||
+ zio->io_type != ZIO_TYPE_FREE) {
+ unexpected_error = B_TRUE;
+ } /* else {
+ zio->io_vd->vdev_notrim = B_TRUE;
+ } */
}
}
}
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Loading...