Discussion:
Resilver stuck at 99%
Liam Slusser via illumos-zfs
2014-06-10 09:16:25 UTC
Permalink
I had a drive go bad two weeks ago - no big deal - I swapped in a
replacement and issued "zpool replace data driveA driveB" and away it went
resilvering. A few days later the resilver was almost done, 99%, and
showed it had 31 minutes left. Great. Come a few days later and hasn't
gone any farther. It's been a week now sitting at 99% done. Although the
time left has slowly been creeping up. It went from 31 minutes left when I
first noticed and has been slowly increasing over the last few days.

Open Indiana oi_151a8 on a Dell r720xd server.

# zpool status
pool: data
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu May 29 11:03:36 2014
393T scanned out of 394T at 410M/s, 0h59m to go
2.61T resilvered, 99.65% done
.
.
.
raidz2-4 DEGRADED 0 0 0
c8t5000C500579C8A23d0 ONLINE 0 0 0
c8t5000C500579CADB3d0 ONLINE 0 0 0
c8t5000C500579C8EA3d0 ONLINE 0 0 0
replacing-3 UNAVAIL 0 0 0
c8t5000C500579C8757d0 UNAVAIL 0 0 0 cannot open
c8t5000C50056FF1FB3d0 ONLINE 0 0 0
(resilvering)
c8t5000C500579C57C7d0 ONLINE 0 0 0
c8t5000C500579C9267d0 ONLINE 0 0 0
c8t5000C500579CAEC7d0 ONLINE 0 0 0
c8t5000C500579CAC17d0 ONLINE 0 0 0

There is nothing interesting in the logs. This is our backup ZFS server,
so its constantly getting updated snapshots via zfs recv. Any idea where
to look?

thanks,
liam



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Paul Kraus via illumos-zfs
2014-06-10 13:12:50 UTC
Permalink
I had a drive go bad two weeks ago - no big deal - I swapped in a replacement and issued "zpool replace data driveA driveB" and away it went resilvering. A few days later the resilver was almost done, 99%, and showed it had 31 minutes left. Great. Come a few days later and hasn't gone any farther. It's been a week now sitting at 99% done. Although the time left has slowly been creeping up. It went from 31 minutes left when I first noticed and has been slowly increasing over the last few days.
I ran into a similar problem years ago under Solaris 10. The issue was the number of snapshots (we had tens of thousands). The resilver would progress up to 98% or 99% done in a reasonable amount of time and then sit there for days, eventually completing. The explanation we got from Sun was that an aspect of the resilver that addressed snapshots was NOT taken into account by the code that reported progress.

--
Paul Kraus
***@kraus-haus.org
Ian Collins via illumos-zfs
2014-06-10 20:18:28 UTC
Permalink
Post by Liam Slusser via illumos-zfs
I had a drive go bad two weeks ago - no big deal - I swapped in a
replacement and issued "zpool replace data driveA driveB" and away it
went resilvering. A few days later the resilver was almost done, 99%,
and showed it had 31 minutes left. Great. Come a few days later and
hasn't gone any farther. It's been a week now sitting at 99% done.
Although the time left has slowly been creeping up. It went from 31
minutes left when I first noticed and has been slowly increasing over
the last few days.
<snip>
Post by Liam Slusser via illumos-zfs
There is nothing interesting in the logs. This is our backup ZFS
server, so its constantly getting updated snapshots via zfs recv. Any
idea where to look?
I think you just answered your own question: "so its constantly getting
updated". If the updates are faster than the resilver.....
--
Ian.
Liam Slusser via illumos-zfs
2014-06-10 20:42:59 UTC
Permalink
Does a zfs recv job cause it to have to resilver the whole filesystem after
each job, or perhaps just resilver the snapshot we sent over? We send a
snapshot every few minutes...each snapshot is maybe a few gigabytes in size.

I can stop the replication for a day and see what happens.

thanks,
liam
Post by Ian Collins via illumos-zfs
Post by Liam Slusser via illumos-zfs
I had a drive go bad two weeks ago - no big deal - I swapped in a
replacement and issued "zpool replace data driveA driveB" and away it went
resilvering. A few days later the resilver was almost done, 99%, and
showed it had 31 minutes left. Great. Come a few days later and hasn't
gone any farther. It's been a week now sitting at 99% done. Although the
time left has slowly been creeping up. It went from 31 minutes left when I
first noticed and has been slowly increasing over the last few days.
<snip>
There is nothing interesting in the logs. This is our backup ZFS server,
so its constantly getting updated snapshots via zfs recv. Any idea where
to look?
I think you just answered your own question: "so its constantly getting
updated". If the updates are faster than the resilver.....
--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins via illumos-zfs
2014-06-11 21:10:06 UTC
Permalink
Post by Liam Slusser via illumos-zfs
Does a zfs recv job cause it to have to resilver the whole filesystem
after each job, or perhaps just resilver the snapshot we sent over?
The latter.
Post by Liam Slusser via illumos-zfs
We send a snapshot every few minutes...each snapshot is maybe a few
gigabytes in size.
I can stop the replication for a day and see what happens.
That would be worth a try. I've done this in the past with a backup
server that was experiencing similar resilver issues.
--
Ian.
Jason Matthews via illumos-zfs
2014-06-11 21:54:26 UTC
Permalink
If you haven't done it already:

echo zfs_resilver_delay/W0 | mdb -kw

Sent from my iPhone
Post by Ian Collins via illumos-zfs
Does a zfs recv job cause it to have to resilver the whole filesystem after each job, or perhaps just resilver the snapshot we sent over?
The latter.
We send a snapshot every few minutes...each snapshot is maybe a few gigabytes in size.
I can stop the replication for a day and see what happens.
That would be worth a try. I've done this in the past with a backup server that was experiencing similar resilver issues.
--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22567878-8480fd5f
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Liam Slusser via illumos-zfs
2014-06-11 22:42:11 UTC
Permalink
Thanks all. I added the zfs_resilver_delay property and stopped my
replication updates and we'll see how it goes. Hopefully that will fix it
up. Thanks everybody for the help, I'll report back once it finishes.

thanks,
liam


On Wed, Jun 11, 2014 at 2:54 PM, Jason Matthews via illumos-zfs <
Post by Jason Matthews via illumos-zfs
echo zfs_resilver_delay/W0 | mdb -kw
Sent from my iPhone
On Jun 11, 2014, at 2:10 PM, "Ian Collins via illumos-zfs" <
Post by Liam Slusser via illumos-zfs
Does a zfs recv job cause it to have to resilver the whole filesystem
after each job, or perhaps just resilver the snapshot we sent over?
The latter.
Post by Liam Slusser via illumos-zfs
We send a snapshot every few minutes...each snapshot is maybe a few
gigabytes in size.
Post by Liam Slusser via illumos-zfs
I can stop the replication for a day and see what happens.
That would be worth a try. I've done this in the past with a backup
server that was experiencing similar resilver issues.
--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22567878-8480fd5f
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Liam Slusser via illumos-zfs
2014-06-13 07:03:35 UTC
Permalink
After stopping the replication it finished a few hours later. Thanks
everybody for the help and hints!

liam
Post by Liam Slusser via illumos-zfs
Thanks all. I added the zfs_resilver_delay property and stopped my
replication updates and we'll see how it goes. Hopefully that will fix it
up. Thanks everybody for the help, I'll report back once it finishes.
thanks,
liam
On Wed, Jun 11, 2014 at 2:54 PM, Jason Matthews via illumos-zfs <
Post by Jason Matthews via illumos-zfs
echo zfs_resilver_delay/W0 | mdb -kw
Sent from my iPhone
On Jun 11, 2014, at 2:10 PM, "Ian Collins via illumos-zfs" <
Post by Liam Slusser via illumos-zfs
Does a zfs recv job cause it to have to resilver the whole filesystem
after each job, or perhaps just resilver the snapshot we sent over?
The latter.
Post by Liam Slusser via illumos-zfs
We send a snapshot every few minutes...each snapshot is maybe a few
gigabytes in size.
Post by Liam Slusser via illumos-zfs
I can stop the replication for a day and see what happens.
That would be worth a try. I've done this in the past with a backup
server that was experiencing similar resilver issues.
--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22567878-8480fd5f
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...