problem importing pool

Discussion:

Mika Anderson

2013-11-17 21:37:53 UTC

I am having difficulty importing a specific pool.

The pool was originally running under OpenIndiana oi_151.1.7 under ESXi 5.1
and had been stable for several months. A few weeks ago, I was deleting a
couple thousand small files on a CIFS share from a Windows client when
OpenIndiana crashed. On reboot, the system would consistently crash unless
the drives were disconnected.

When the system first crashed, I posted a message on the
OpenIndiana-discuss list (
http://openindiana.org/pipermail/openindiana-discuss/2013-October/014089.html
). At that point, I thought I would be able to import the pool and have
everything back up and running. However, when I actually ran zpool import
-f, again the system crashed.

In further testing, I installed a few different ZFS capable systems
including FreeNAS and OmniOS. With each one, attempting to import the pool
would result in some type of crash. Most recently, Ive been trying
OmniOS, a typical session is below.

***@omnios:~# zpool import

pool: pool_cp

id: 2315603491305675713

state: ONLINE

status: Some supported features are not enabled on the pool.

action: The pool can be imported using its name or numeric identifier,
though

some features will not be available without an explicit 'zpool
upgrade'.

config:

pool_cp ONLINE

c20t7d0 ONLINE

c20t8d0 ONLINE

c27t6d0 ONLINE

c24t2d0 ONLINE

logs

c2t2d0 ONLINE <==this is a vmdk file

pool: pool_4k

id: 3656244351407620617

state: ONLINE

status: The pool was last accessed by another system.

action: The pool can be imported using its name or numeric identifier and

the '-f' flag.

see: http://illumos.org/msg/ZFS-8000-EY

config:

pool_4k ONLINE

mirror-0 ONLINE

c7t50014EE0036CF68Ed0 ONLINE

c7t50014EE0036D0018d0 ONLINE

mirror-2 ONLINE

c20t2d0 ONLINE

c20t3d0 ONLINE

mirror-3 ONLINE

c20t4d0 ONLINE

c20t5d0 ONLINE

mirror-4 ONLINE

c7t50014EE207A0D2F9d0 ONLINE

c7t50014EE0AE214005d0 ONLINE

mirror-5 ONLINE

c20t13d0 ONLINE

c20t14d0 ONLINE

cache

c7t5001517BB29D9C18d0 <==not showing online?

logs

c2t1d0 ONLINE <==this is a vmdk file

***@omnios:~#

***@omnios:~# zpool import -f pool_cp

***@omnios:/# cd /pool_cp

***@omnios:/pool_cp# ls

Bonnie.log crashplan

***@omnios:/pool_cp# cd ~

***@omnios:~# zpool export pool_cp

***@omnios:~# zpool import -f pool_4k

****system crash****

Im looking for some information on what I can try to get this pool
imported. If it would help, I could post a crash dump.

Thanks in advance for any help.

Best regards,

Mika

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-11-17 21:56:24 UTC

Permalink

Post by Mika Anderson
I am having difficulty importing a specific pool.
The pool was originally running under OpenIndiana oi_151.1.7 under ESXi
5.1 and had been stable for several months. A few weeks ago, I was
deleting a couple thousand small files on a CIFS share from a Windows
client when OpenIndiana crashed. On reboot, the system would
consistently crash unless the drives were disconnected.

Just in case: did the storage system use dedup? This could produce
a lot of load to release the files (find all deduped blocks in DDT
and reduce the counter, release blocks on reaching zero... something
here can require lots of unswappable kernel RAM).

My experience with such crashes was that after a couple of weeks of
rebooting the computer, it finally cleared up the deletion queue.
But a characteristic circumstance was freezing (after 3-5 hours of
work) with a rapid drop in available RAM and spike in scanrate to
millions (within just a few seconds from "normal" profile of the
hard work on pool-import processing).

An instrument to keep me content that something is going on beside
the reboots was the ZDB debugger: it can report information about
the pool's metadata, including the size "Deferred Free" list. This
one consistently went down between reboots, and when it got empty -
the system finally worked well.

I did have some other bugs with my test pools that caused ZFS to
panic, but can't really remember any details quickly. Probably it
might also help to "zpool export" your problematic pool or remove
/etc/zfs/zpool.cache file to keep it from importing automatically
and run ZDB on the pool with different options (such as leak tracing) -
maybe the underlying storage (VMDKs?) lied and corrupted something
badly. It might help to try importing the pool (with alternate-root
"zpool import -R / ..." to disable logging the imported pool into
the cachefile) without its log and cache devices and discard some
recent transactions - or it might not help :\

HTH,
//Jim Klimov

Mika Anderson

2013-11-18 02:48:42 UTC

Permalink

Thank you for your comments Jim. Please see below.

Post by Jim Klimov

I have not enabled dedup after reading many cautions on mailing lists and
forums.

Post by Jim Klimov
...
I did have some other bugs with my test pools that caused ZFS to
panic, but can't really remember any details quickly. Probably it
might also help to "zpool export" your problematic pool or remove
/etc/zfs/zpool.cache file to keep it from importing automatically
and run ZDB on the pool with different options (such as leak tracing) -
maybe the underlying storage (VMDKs?) lied and corrupted something
badly. It might help to try importing the pool (with alternate-root
"zpool import -R / ..." to disable logging the imported pool into
the cachefile) without its log and cache devices and discard some
recent transactions - or it might not help :\
HTH,
//Jim Klimov

In my original OI installation, I was unable to run zpool export because
the system would never successfully boot (unless I disconnected the
drives). I'm now working from a fresh OmniOS install and of course can't
export the pool because it's not active.

Can you use zdb on a pool which is not imported? If so, I'm not sure how.

Running the following command resulted in a similar immediate system panic:
***@omnios:/# zpool import -f -R /testimport pool_4k

I suspect there may be some corruption in the ZIL. I'd be willing to lose
some transactions if it would allow the pool to be imported.

Thanks again for any help.

Cheers,
Mika

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Paul Kraus

2013-11-18 13:51:43 UTC

Permalink

Post by Mika Anderson
Can you use zdb on a pool which is not imported? If so, I'm not sure how.

The '-e' option lets you run zdb against a non-imported zpool.

--
Paul Kraus
Deputy Technical Director, LoneStarCon 3
Sound Coordinator, Schenectady Light Opera Company

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Mika Anderson

2013-11-18 16:48:15 UTC

Permalink

Post by Mika Anderson
Can you use zdb on a pool which is not imported? If so, I'm not sure how.
The '-e' option lets you run zdb against a non-imported zpool.

Thanks for the tip Paul. Not sure how I missed this in the man page.

As per Jim's suggestion, I ran "zdb -e -L pool_4k" which returned several
thousand lines ending in:
assertion failed for thread 0xfffffd7fff162a40, thread-id 1: dn->dn_nlevels
<= 30 (0x21 <= 0x1e), file ../../../uts/common/fs/zfs/dnode.c, line 219
Abort (core dumped)

I've posted the entire thing here:
https://gist.github.com/ma1245/62890cb46c139d2b5c2d

Does anyone have insight on where I should go from here?

Cheers,
Mika

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-11-18 17:59:22 UTC

Permalink

Post by Paul Kraus

Post by Mika Anderson
Can you use zdb on a pool which is not imported? If so, I'm not sure how.

The '-e' option lets you run zdb against a non-imported zpool.
Thanks for the tip Paul. Not sure how I missed this in the man page.
As per Jim's suggestion, I ran "zdb -e -L pool_4k" which returned
dn->dn_nlevels <= 30 (0x21 <= 0x1e), file
../../../uts/common/fs/zfs/dnode.c, line 219
Abort (core dumped)
https://gist.github.com/ma1245/62890cb46c139d2b5c2d
Does anyone have insight on where I should go from here?

So far no ideas, except that did you try the "hacks" for not-aborting
on encountering such errors? I think it may be "zdb -AAA" and/or some
kernel-side flags like "aok" (search the internet archives for details).

I am not sure what that would give except for ZDB traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier TXG number (-t)
to see if there is an intact metadata tree that you can revert to by
losing some transactions committed later?

HTH,
//Jim

Mika Anderson

2013-11-19 03:55:32 UTC

Permalink

Post by Jim Klimov

Post by Mika Anderson
As per Jim's suggestion, I ran "zdb -e -L pool_4k" which returned
dn->dn_nlevels <= 30 (0x21 <= 0x1e), file
../../../uts/common/fs/zfs/dnode.c, line 219
Abort (core dumped)
https://gist.github.com/ma1245/62890cb46c139d2b5c2d
Does anyone have insight on where I should go from here?

So far no ideas, except that did you try the "hacks" for not-aborting
on encountering such errors? I think it may be "zdb -AAA" and/or some
kernel-side flags like "aok" (search the internet archives for details).
I am not sure what that would give except for ZDB traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier TXG number (-t)
to see if there is an intact metadata tree that you can revert to by
losing some transactions committed later?

Thanks for the suggestions Jim.

Ive spent some time working with zdb and although I still dont have a
great understanding, some patterns have emerged.

Running zdb with switches -u (uberblock), -d (datasets), -i (intent logs)
-h (pool history) -m (metaslabs) all return successfully. Example:

***@omnios:~# zdb -e -u pool_4k

Uberblock:
magic = 0000000000bab10c
version = 5000
txg = 5996179
guid_sum = 4654748689984317566
timestamp = 1382409648 UTC = Tue Oct 22 02:40:48 2013

But running zdb with either -b (block statistics) or -c (checksum metadata
blocks) both fail with similar errors (note that there is more than 5TB in
this pool).
***@omnios:~# zdb -e -b pool_4k
Traversing all blocks to verify nothing leaked ...
295M completed ( 296MB/s) estimated time remaining: 7hr 04min 22sec
1.57G completed ( 813MB/s) estimated time remaining: 2hr 34min 53sec
<..snip..>
70.0G completed (4859MB/s) estimated time remaining: 0hr 25min 41sec
71.1G completed (4630MB/s) estimated time remaining: 0hr 26min 57sec
assertion failed for thread 0xfffffd7fff162a40, thread-id 1:
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)

***@omnios:~# zdb -e -c pool_4k
Traversing all blocks to verify metadata checksums and verify nothing
leaked ...
8.91M completed ( 9MB/s) estimated time remaining: 230hr 30min 36sec

11.2M completed ( 5MB/s) estimated time remaining: 367hr 11min 40sec

<..snip..>
70.4G completed ( 748MB/s) estimated time remaining: 2hr 46min 44sec
71.8G completed ( 756MB/s) estimated time remaining: 2hr 45min 02sec
assertion failed for thread 0xfffffd7fff162a40, thread-id 1:
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)

Running the same commands with -AAA (ignore assertions and enable panic
recovery) also fails
***@omnios:~# zdb -e -b -AAA pool_4k
Traversing all blocks to verify nothing leaked ...
294M completed ( 300MB/s) estimated time remaining: 6hr 58min 55sec
1.54G completed ( 806MB/s) estimated time remaining: 2hr 36min 14sec
<..snip..>
70.0G completed (4566MB/s) estimated time remaining: 0hr 27min 20sec
71.3G completed (4379MB/s) estimated time remaining: 0hr 28min 29sec
assertion failed for thread 0xfffffd7fff162a40, thread-id 1:
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)

Next thing I tried was with -F (attempt automatic rewind within safe range
of transaction groups), this errored out within seconds:
***@omnios:~# zdb -e -b -AAA -F o pool_4k
assertion failed for thread 0xfffffd7ffbb0f240, thread-id 64: c <
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 244
Abort (core dumped)

Then I tried -t (highest txg to use when searching for uberblocks) and took
the one less than the txg listed by zdb -u. Basically the same error.
***@omnios:~# zdb -e -b -t 5996178 pool_4k
Traversing all blocks to verify nothing leaked ...
295M completed ( 298MB/s) estimated time remaining: 7hr 02min 42sec
1.80G completed ( 938MB/s) estimated time remaining: 2hr 14min 11sec
<..snip..>
70.0G completed (4552MB/s) estimated time remaining: 0hr 27min 25sec
71.8G completed (4392MB/s) estimated time remaining: 0hr 28min 24sec
assertion failed for thread 0xfffffd7fff162a40, thread-id 1:
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)

Lastly I tried using the most recent txg that was shown when doing zdb -hh,
this errored out immediately:
***@omnios:~# zdb -e -b -t 5995690 pool_4k
Segmentation Fault (core dumped)

So at this point Im sort of stuck again. I feel like Im (slowly)
learning but clearly still have a long ways to go. Id still really like
to recover the pool, and I feel that its possible but Im not sure what
the next steps should be.

If anyone has any insight or can make some more suggestions on what to try,
Id be very gracious. Also, if anyone has cautions on things which may
further damage the pool, that would be appreciated.

Cheers,
Mika

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Mika Anderson

2013-12-30 23:06:13 UTC

Permalink

Sorry to revive this thread. I now have some time to look at this issue
and this pool is still a problem.

As a recap, this issue started when deleting about 2500 small files from
CIFS share over the network. About 2/3rds of the way through the delete,
OpenIndiana crashed and went into a reboot loop -- each time it rebooted,
it crashed. To troubleshoot the pool, I installed OmniOS and worked with
zpool and zdb gathering information. Any attempt to mount the pool (even
with -o readonly=on) fails with a panic. Several zdb commands also cause a
panic.

Is anyone interested in looking at the crash dumps from an zpool import
command?

I created a panic omnios-6de5e81 2013.11.27 by running zpool import -f
pool_4k
Below is the result of the fmdump command I was prompted to run upon
reboot. The attached crash.2 was created by following the directions here:
http://wiki.illumos.org/display/illumos/How+To+Report+Problems

***@OmniOS:/var/crash/unknown# fmdump -Vp -u
9fe80747-70f8-6cc0-d076-f86de1d808
30
TIME UUID
SUNW-MSG-ID
Dec 30 2013 15:21:43.487897000 9fe80747-70f8-6cc0-d076-f86de1d80830
SUNOS-8000-KL

TIME CLASS ENA
Dec 30 15:21:43.4854 ireport.os.sunos.panic.dump_available
0x0000000000000000
Dec 30 15:21:40.9688 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000

nvlist version: 0
version = 0x0
class = list.suspect
uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
code = SUNOS-8000-KL
diag-time = 1388442103 486125
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
resource =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.2
os-instance-uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
panicstr = BAD TRAP: type=d (#gp General protection)
rp=ffffff001097bdd0 addr=0
panicstack = unix:real_mode_stop_cpu_stage2_end+9de3 () |
unix:trap+a30 () | unix:cmntrap+e6 () | unix:mutex_enter+b () |
zfs:zio_buf_alloc+25 () | zfs:arc_get_data_buf+1d0 () |
zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b () | zfs:traverse_prefetcher+105
() | zfs:traverse_visitbp+271 () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+536 () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+5fd () | zfs:traverse_prefetch_thread+79 () |
genunix:taskq_d_thread+b7 () | unix:thread_start+8 () |
crashtime = 1388442073
panic-time = Mon Dec 30 15:21:13 2013 MST
(end fault-list[0])

fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c1f1f7 0x1d14b7a8

***@OmniOS:/var/crash/unknown#

I'm still interested in recovering the pool if possible but am prepared to
rebuild it if necessary.

Thanks and happy holidays!

Mika

Post by Mika Anderson

Post by Jim Klimov

So far no ideas, except that did you try the "hacks" for not-aborting
on encountering such errors? I think it may be "zdb -AAA" and/or some
kernel-side flags like "aok" (search the internet archives for details).
I am not sure what that would give except for ZDB traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier TXG number (-t)
to see if there is an intact metadata tree that you can revert to by
losing some transactions committed later?

Thanks for the suggestions Jim.
Ive spent some time working with zdb and although I still dont have a
great understanding, some patterns have emerged.
Running zdb with switches -u (uberblock), -d (datasets), -i (intent logs)
magic = 0000000000bab10c
version = 5000
txg = 5996179
guid_sum = 4654748689984317566
timestamp = 1382409648 UTC = Tue Oct 22 02:40:48 2013
But running zdb with either -b (block statistics) or -c (checksum metadata
blocks) both fail with similar errors (note that there is more than 5TB in
this pool).
Traversing all blocks to verify nothing leaked ...
295M completed ( 296MB/s) estimated time remaining: 7hr 04min 22sec
1.57G completed ( 813MB/s) estimated time remaining: 2hr 34min 53sec
<..snip..>
70.0G completed (4859MB/s) estimated time remaining: 0hr 25min 41sec
71.1G completed (4630MB/s) estimated time remaining: 0hr 26min 57sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Traversing all blocks to verify metadata checksums and verify nothing
leaked ...
8.91M completed ( 9MB/s) estimated time remaining: 230hr 30min 36sec
11.2M completed ( 5MB/s) estimated time remaining: 367hr 11min 40sec
<..snip..>
70.4G completed ( 748MB/s) estimated time remaining: 2hr 46min 44sec
71.8G completed ( 756MB/s) estimated time remaining: 2hr 45min 02sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Running the same commands with -AAA (ignore assertions and enable panic
recovery) also fails
Traversing all blocks to verify nothing leaked ...
294M completed ( 300MB/s) estimated time remaining: 6hr 58min 55sec
1.54G completed ( 806MB/s) estimated time remaining: 2hr 36min 14sec
<..snip..>
70.0G completed (4566MB/s) estimated time remaining: 0hr 27min 20sec
71.3G completed (4379MB/s) estimated time remaining: 0hr 28min 29sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Next thing I tried was with -F (attempt automatic rewind within safe range
assertion failed for thread 0xfffffd7ffbb0f240, thread-id 64: c <
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 244
Abort (core dumped)
Then I tried -t (highest txg to use when searching for uberblocks) and
took the one less than the txg listed by zdb -u. Basically the same error.
Traversing all blocks to verify nothing leaked ...
295M completed ( 298MB/s) estimated time remaining: 7hr 02min 42sec
1.80G completed ( 938MB/s) estimated time remaining: 2hr 14min 11sec
<..snip..>
70.0G completed (4552MB/s) estimated time remaining: 0hr 27min 25sec
71.8G completed (4392MB/s) estimated time remaining: 0hr 28min 24sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Lastly I tried using the most recent txg that was shown when doing zdb
Segmentation Fault (core dumped)
So at this point Im sort of stuck again. I feel like Im (slowly)
learning but clearly still have a long ways to go. Id still really like
to recover the pool, and I feel that its possible but Im not sure what
the next steps should be.
If anyone has any insight or can make some more suggestions on what to
try, Id be very gracious. Also, if anyone has cautions on things which
may further damage the pool, that would be appreciated.
Cheers,
Mika

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

surya

2013-12-31 08:34:02 UTC

Permalink

What does the panic stack look like when you attempt ro import?
arc_get_data_buf()->zio_buf_alloc()->mutex_enter() panic doesn't seem
to indicate any onDisk issue - but more an in-kernel race. I would also try
to offline all but 1-cpu and retry the import.
-surya

Post by Mika Anderson
Sorry to revive this thread. I now have some time to look at this
issue and this pool is still a problem.
As a recap, this issue started when deleting about 2500 small files
from CIFS share over the network. About 2/3rds of the way through the
delete, OpenIndiana crashed and went into a reboot loop -- each time
it rebooted, it crashed. To troubleshoot the pool, I installed OmniOS
and worked with zpool and zdb gathering information. Any attempt to
mount the pool (even with -o readonly=on) fails with a panic. Several
zdb commands also cause a panic.
Is anyone interested in looking at the crash dumps from an zpool
import command?
I created a panic omnios-6de5e81 2013.11.27 by running zpool import -f
pool_4k
Below is the result of the fmdump command I was prompted to run upon
reboot. The attached crash.2 was created by following the directions
here: http://wiki.illumos.org/display/illumos/How+To+Report+Problems
9fe80747-70f8-6cc0-d076-f86de1d808
30
TIME UUID SUNW-MSG-ID
Dec 30 2013 15:21:43.487897000 9fe80747-70f8-6cc0-d076-f86de1d80830
SUNOS-8000-KL
TIME CLASS ENA
Dec 30 15:21:43.4854 ireport.os.sunos.panic.dump_available
0x0000000000000000
Dec 30 15:21:40.9688 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
code = SUNOS-8000-KL
diag-time = 1388442103 486125
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
resource =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.2
os-instance-uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
panicstr = BAD TRAP: type=d (#gp General protection)
rp=ffffff001097bdd0 addr=0
panicstack = unix:real_mode_stop_cpu_stage2_end+9de3
() | unix:trap+a30 () | unix:cmntrap+e6 () | unix:mutex_enter+b () |
zfs:zio_buf_alloc+25 () | zfs:arc_get_data_buf+1d0 () |
zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b () |
zfs:traverse_prefetcher+105 () | zfs:traverse_visitbp+271 () |
zfs:traverse_dnode+8b () | zfs:traverse_visitbp+536 () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_dnode+8b () | zfs:traverse_visitbp+5fd () |
zfs:traverse_prefetch_thread+79 () | genunix:taskq_d_thread+b7 () |
unix:thread_start+8 () |
crashtime = 1388442073
panic-time = Mon Dec 30 15:21:13 2013 MST
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c1f1f7 0x1d14b7a8
I'm still interested in recovering the pool if possible but am
prepared to rebuild it if necessary.
Thanks and happy holidays!
Mika
On Mon, Nov 18, 2013 at 8:55 PM, Mika Anderson
As per Jim's suggestion, I ran "zdb -e -L pool_4k" which returned
dn->dn_nlevels <= 30 (0x21 <= 0x1e), file
../../../uts/common/fs/zfs/dnode.c, line 219
Abort (core dumped)
https://gist.github.com/ma1245/62890cb46c139d2b5c2d
Does anyone have insight on where I should go from here?
So far no ideas, except that did you try the "hacks" for not-aborting
on encountering such errors? I think it may be "zdb -AAA" and/or some
kernel-side flags like "aok" (search the internet archives for details).
I am not sure what that would give except for ZDB traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier TXG number (-t)
to see if there is an intact metadata tree that you can revert to by
losing some transactions committed later?
Thanks for the suggestions Jim.
Ive spent some time working with zdb and although I still dont
have a great understanding, some patterns have emerged.
Running zdb with switches -u (uberblock), -d (datasets), -i
(intent logs) -h (pool history) -m (metaslabs) all return
magic = 0000000000bab10c
version = 5000
txg = 5996179
guid_sum = 4654748689984317566
timestamp = 1382409648 UTC = Tue Oct 22 02:40:48 2013
But running zdb with either -b (block statistics) or -c (checksum
metadata blocks) both fail with similar errors (note that there is
more than 5TB in this pool).
Traversing all blocks to verify nothing leaked ...
295M completed ( 296MB/s) estimated time remaining: 7hr 04min 22sec
1.57G completed ( 813MB/s) estimated time remaining: 2hr 34min 53sec
<..snip..>
70.0G completed (4859MB/s) estimated time remaining: 0hr 25min 41sec
71.1G completed (4630MB/s) estimated time remaining: 0hr 26min
57sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Traversing all blocks to verify metadata checksums and verify
nothing leaked ...
8.91M completed ( 9MB/s) estimated time remaining: 230hr 30min 36sec
11.2M completed ( 5MB/s) estimated time remaining: 367hr 11min 40sec
<..snip..>
70.4G completed ( 748MB/s) estimated time remaining: 2hr 46min 44sec
71.8G completed ( 756MB/s) estimated time remaining: 2hr 45min
02sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Running the same commands with -AAA (ignore assertions and enable
panic recovery) also fails
Traversing all blocks to verify nothing leaked ...
294M completed ( 300MB/s) estimated time remaining: 6hr 58min 55sec
1.54G completed ( 806MB/s) estimated time remaining: 2hr 36min 14sec
<..snip..>
70.0G completed (4566MB/s) estimated time remaining: 0hr 27min 20sec
71.3G completed (4379MB/s) estimated time remaining: 0hr 28min
29sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Next thing I tried was with -F (attempt automatic rewind within
assertion failed for thread 0xfffffd7ffbb0f240, thread-id 64: c <
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 244
Abort (core dumped)
Then I tried -t (highest txg to use when searching for uberblocks)
and took the one less than the txg listed by zdb -u. Basically
the same error.
Traversing all blocks to verify nothing leaked ...
295M completed ( 298MB/s) estimated time remaining: 7hr 02min 42sec
1.80G completed ( 938MB/s) estimated time remaining: 2hr 14min 11sec
<..snip..>
70.0G completed (4552MB/s) estimated time remaining: 0hr 27min 25sec
71.8G completed (4392MB/s) estimated time remaining: 0hr 28min
24sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Lastly I tried using the most recent txg that was shown when doing
Segmentation Fault (core dumped)
So at this point Im sort of stuck again. I feel like Im
(slowly) learning but clearly still have a long ways to go. Id
still really like to recover the pool, and I feel that its
possible but Im not sure what the next steps should be.
If anyone has any insight or can make some more suggestions on
what to try, Id be very gracious. Also, if anyone has cautions
on things which may further damage the pool, that would be
appreciated.
Cheers,
Mika
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>

George Wilson

2013-12-31 14:55:04 UTC

Permalink

The stack below looks very strange since the top two frames don't seem
to be valid (mutex_enter never calls into vdev code).

ffffff0008a25c40 fffffffffbc2eb80 0 0 60 0
PC: panicsys+0x109 TASKQ: system_taskq
stack pointer for thread ffffff0008a25c40: ffffff0008a247f0
vdev_queue_class_to_issue+0xea(ffffff01d84998f8)
vdev_queue_io_to_issue+0xa1(fffffffffb934de8)
mutex_enter+0xb()
zio_buf_alloc+0x25(371600)
arc_get_data_buf+0x1d0(ffffff01d9cc8828)
arc_buf_alloc+0xb5(ffffff01d5cdfb00, 371600, 0, 1)
arc_read+0x42b(0, ffffff01d5cdfb00, ffffff01d9e57cc0, 0, 0, 2,
ffffff00000000c0, ffffff0008a250fc, ffffff0008a25270)
traverse_prefetcher+0x105(ffffff01d5cdfb00, 0, ffffff01d9e57cc0,
ffffff0008a25270, ffffff01d9e57c00, ffffff00085ca620)
traverse_visitbp+0x271(ffffff0008a25b00, ffffff01d9e57c00,
ffffff01d9e57cc0
, ffffff0008a25270)
traverse_dnode+0x8b(ffffff0008a25b00, ffffff01d9e57c00, ab, 1b8e)
traverse_visitbp+0x536(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d9d9ae00
, ffffff0008a25440)
traverse_visitbp+0x3fa(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d9d9c080
, ffffff0008a25530)
traverse_visitbp+0x3fa(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d9db4000
, ffffff0008a25620)
traverse_visitbp+0x3fa(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d9d88000
, ffffff0008a25710)
traverse_visitbp+0x3fa(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d9ce5000
, ffffff0008a25800)
traverse_visitbp+0x3fa(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d9ce9000
, ffffff0008a258f0)
traverse_visitbp+0x3fa(ffffff0008a25b00, ffffff01d75d4000,
ffffff01d75d4040
, ffffff0008a25990)
traverse_dnode+0x8b(ffffff0008a25b00, ffffff01d75d4000, ab, 0)
traverse_visitbp+0x5fd(ffffff0008a25b00, 0, ffffff01d93e6e80,
ffffff0008a25b50)
traverse_prefetch_thread+0x79(ffffff00085ca5b0)
taskq_d_thread+0xb7(ffffff01cf24e860)
thread_start+8()

If we ignore those frames then I would guess that somehow
zio_buf_alloc() was called with a size > 128K and has resulted in
zio_buf_cache{c] walking off into some invalid piece of memory and
passing that to kmem_cache_alloc(). This would imply that a block on
disk has an invalid size but a correct checksum. It would be interesting
to take a closer look at this pool to see what types of corruption exists.

- George

Post by Mika Anderson
Sorry to revive this thread. I now have some time to look at this
issue and this pool is still a problem.
As a recap, this issue started when deleting about 2500 small files
from CIFS share over the network. About 2/3rds of the way through
the delete, OpenIndiana crashed and went into a reboot loop -- each
time it rebooted, it crashed. To troubleshoot the pool, I installed
OmniOS and worked with zpool and zdb gathering information. Any
attempt to mount the pool (even with -o readonly=on) fails with a
panic. Several zdb commands also cause a panic.
Is anyone interested in looking at the crash dumps from an zpool
import command?
I created a panic omnios-6de5e81 2013.11.27 by running zpool import
-f pool_4k
Below is the result of the fmdump command I was prompted to run upon
reboot. The attached crash.2 was created by following the directions
here: http://wiki.illumos.org/display/illumos/How+To+Report+Problems
9fe80747-70f8-6cc0-d076-f86de1d808
30
TIME UUID SUNW-MSG-ID
Dec 30 2013 15:21:43.487897000 9fe80747-70f8-6cc0-d076-f86de1d80830
SUNOS-8000-KL
TIME CLASS ENA
Dec 30 15:21:43.4854 ireport.os.sunos.panic.dump_available
0x0000000000000000
Dec 30 15:21:40.9688 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
code = SUNOS-8000-KL
diag-time = 1388442103 486125
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
resource =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.2
os-instance-uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
panicstr = BAD TRAP: type=d (#gp General protection)
rp=ffffff001097bdd0 addr=0
panicstack = unix:real_mode_stop_cpu_stage2_end+9de3
() | unix:trap+a30 () | unix:cmntrap+e6 () | unix:mutex_enter+b () |
zfs:zio_buf_alloc+25 () | zfs:arc_get_data_buf+1d0 () |
zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b () |
zfs:traverse_prefetcher+105 () | zfs:traverse_visitbp+271 () |
zfs:traverse_dnode+8b () | zfs:traverse_visitbp+536 () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_dnode+8b () | zfs:traverse_visitbp+5fd () |
zfs:traverse_prefetch_thread+79 () | genunix:taskq_d_thread+b7 () |
unix:thread_start+8 () |
crashtime = 1388442073
panic-time = Mon Dec 30 15:21:13 2013 MST
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c1f1f7 0x1d14b7a8
I'm still interested in recovering the pool if possible but am
prepared to rebuild it if necessary.
Thanks and happy holidays!
Mika
On Mon, Nov 18, 2013 at 8:55 PM, Mika Anderson
As per Jim's suggestion, I ran "zdb -e -L pool_4k" which returned
dn->dn_nlevels <= 30 (0x21 <= 0x1e), file
../../../uts/common/fs/zfs/dnode.c, line 219
Abort (core dumped)
https://gist.github.com/ma1245/62890cb46c139d2b5c2d
Does anyone have insight on where I should go from here?
So far no ideas, except that did you try the "hacks" for not-aborting
on encountering such errors? I think it may be "zdb -AAA" and/or some
kernel-side flags like "aok" (search the internet archives for details).
I am not sure what that would give except for ZDB traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier TXG number (-t)
to see if there is an intact metadata tree that you can revert to by
losing some transactions committed later?
Thanks for the suggestions Jim.
Ive spent some time working with zdb and although I still dont
have a great understanding, some patterns have emerged.
Running zdb with switches -u (uberblock), -d (datasets), -i
(intent logs) -h (pool history) -m (metaslabs) all return
magic = 0000000000bab10c
version = 5000
txg = 5996179
guid_sum = 4654748689984317566
timestamp = 1382409648 UTC = Tue Oct 22 02:40:48 2013
But running zdb with either -b (block statistics) or -c (checksum
metadata blocks) both fail with similar errors (note that there
is more than 5TB in this pool).
Traversing all blocks to verify nothing leaked ...
295M completed ( 296MB/s) estimated time remaining: 7hr 04min 22sec
1.57G completed ( 813MB/s) estimated time remaining: 2hr 34min 53sec
<..snip..>
70.0G completed (4859MB/s) estimated time remaining: 0hr 25min 41sec
71.1G completed (4630MB/s) estimated time remaining: 0hr 26min
57sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Traversing all blocks to verify metadata checksums and verify
nothing leaked ...
8.91M completed ( 9MB/s) estimated time remaining: 230hr 30min 36sec
11.2M completed ( 5MB/s) estimated time remaining: 367hr 11min 40sec
<..snip..>
70.4G completed ( 748MB/s) estimated time remaining: 2hr 46min 44sec
71.8G completed ( 756MB/s) estimated time remaining: 2hr 45min
02sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Running the same commands with -AAA (ignore assertions and enable
panic recovery) also fails
Traversing all blocks to verify nothing leaked ...
294M completed ( 300MB/s) estimated time remaining: 6hr 58min 55sec
1.54G completed ( 806MB/s) estimated time remaining: 2hr 36min 14sec
<..snip..>
70.0G completed (4566MB/s) estimated time remaining: 0hr 27min 20sec
71.3G completed (4379MB/s) estimated time remaining: 0hr 28min
29sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Next thing I tried was with -F (attempt automatic rewind within
assertion failed for thread 0xfffffd7ffbb0f240, thread-id 64: c <
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 244
Abort (core dumped)
Then I tried -t (highest txg to use when searching for
uberblocks) and took the one less than the txg listed by zdb -u.
Basically the same error.
Traversing all blocks to verify nothing leaked ...
295M completed ( 298MB/s) estimated time remaining: 7hr 02min 42sec
1.80G completed ( 938MB/s) estimated time remaining: 2hr 14min 11sec
<..snip..>
70.0G completed (4552MB/s) estimated time remaining: 0hr 27min 25sec
71.8G completed (4392MB/s) estimated time remaining: 0hr 28min
24sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Lastly I tried using the most recent txg that was shown when
Segmentation Fault (core dumped)
So at this point Im sort of stuck again. I feel like Im
(slowly) learning but clearly still have a long ways to go. Id
still really like to recover the pool, and I feel that its
possible but Im not sure what the next steps should be.
If anyone has any insight or can make some more suggestions on
what to try, Id be very gracious. Also, if anyone has cautions
on things which may further damage the pool, that would be
appreciated.
Cheers,
Mika
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>

*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>

Mika Anderson

2014-01-01 23:36:19 UTC

Permalink

Surya, thank you for this information. Here is what I tried:

OmniOS 5.11 omnios-6de5e81 2013.11.27
***@OmniOS:~# psradm -f 1
***@OmniOS:~# psrinfo
0 on-line since 01/01/2014 16:02:53
1 off-line since 01/01/2014 16:17:45
***@OmniOS:~# zpool import -f -o readonly=on pool_4k

This resulted in the a panic. After rebooting I did the following:

***@OmniOS:/var/crash/unknown# pfexec savecore -vf vmdump.4
savecore: System dump time: Wed Jan 1 16:19:02 2014
savecore: saving system crash dump in /var/crash/unknown/{unix,vmcore}.4
Constructing namelist /var/crash/unknown/unix.4
Constructing corefile /var/crash/unknown/vmcore.4
0:01 100% done: 143006 of 143006 pages saved
2030 (1%) zero pages were not written
0:01 dump decompress is done
***@OmniOS:/var/crash/unknown# echo '::panicinfo\n::cpuinfo
-v\n::threadlist -v 10\n::msgbuf\n*panic_thread::findstack -v\n::stacks' |
mdb 4 > crash.4
***@OmniOS:/var/crash/unknown# fmdump -Vp -u
ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
TIME UUID
SUNW-MSG-ID
Jan 01 2014 16:19:47.050511000 ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
SUNOS-8000-KL

TIME CLASS ENA
Jan 01 16:19:47.0489 ireport.os.sunos.panic.dump_available
0x0000000000000000
Jan 01 16:19:44.2099 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000

nvlist version: 0
version = 0x0
class = list.suspect
uuid = ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
code = SUNOS-8000-KL
diag-time = 1388618387 49576
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
resource =
sw:///:path=/var/crash/unknown/.ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.4
os-instance-uuid = ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
panicstr = BAD TRAP: type=d (#gp General protection)
rp=ffffff000f7e3dd0 addr=0
panicstack = unix:real_mode_stop_cpu_stage2_end+9de3 () |
unix:trap+a30 () | unix:cmntrap+e6 () | unix:mutex_enter+b () |
zfs:zio_buf_alloc+25 () | zfs:arc_get_data_buf+1d0 () |
zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b () | zfs:traverse_prefetcher+105
() | zfs:traverse_visitbp+271 () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+536 () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+5fd () | zfs:traverse_prefetch_thread+79 () |
genunix:taskq_d_thread+b7 () | unix:thread_start+8 () |
crashtime = 1388618342
panic-time = Wed Jan 1 16:19:02 2014 MST
(end fault-list[0])

fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c4a293 0x302bc98

Interpreting these messages is far beyond my knowledge.

Cheers.

Mika

Post by surya
What does the panic stack look like when you attempt ro import?
arc_get_data_buf()->zio_buf_alloc()->mutex_enter() panic doesn't seem
to indicate any onDisk issue - but more an in-kernel race. I would also try
to offline all but 1-cpu and retry the import.
-surya
Sorry to revive this thread. I now have some time to look at this issue
and this pool is still a problem.
As a recap, this issue started when deleting about 2500 small files from
CIFS share over the network. About 2/3rds of the way through the delete,
OpenIndiana crashed and went into a reboot loop -- each time it rebooted,
it crashed. To troubleshoot the pool, I installed OmniOS and worked with
zpool and zdb gathering information. Any attempt to mount the pool (even
with -o readonly=on) fails with a panic. Several zdb commands also cause a
panic.
Is anyone interested in looking at the crash dumps from an zpool import
command?
I created a panic omnios-6de5e81 2013.11.27 by running zpool import -f
pool_4k
Below is the result of the fmdump command I was prompted to run upon
http://wiki.illumos.org/display/illumos/How+To+Report+Problems
9fe80747-70f8-6cc0-d076-f86de1d808
30
TIME UUID
SUNW-MSG-ID
Dec 30 2013 15:21:43.487897000 9fe80747-70f8-6cc0-d076-f86de1d80830
SUNOS-8000-KL
TIME CLASS ENA
Dec 30 15:21:43.4854 ireport.os.sunos.panic.dump_available
0x0000000000000000
Dec 30 15:21:40.9688 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
code = SUNOS-8000-KL
diag-time = 1388442103 486125
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
resource =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.2
os-instance-uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
panicstr = BAD TRAP: type=d (#gp General protection)
rp=ffffff001097bdd0 addr=0
panicstack = unix:real_mode_stop_cpu_stage2_end+9de3 () |
unix:trap+a30 () | unix:cmntrap+e6 () | unix:mutex_enter+b () |
zfs:zio_buf_alloc+25 () | zfs:arc_get_data_buf+1d0 () |
zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b () | zfs:traverse_prefetcher+105
() | zfs:traverse_visitbp+271 () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+536 () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+5fd () | zfs:traverse_prefetch_thread+79 () |
genunix:taskq_d_thread+b7 () | unix:thread_start+8 () |
crashtime = 1388442073
panic-time = Mon Dec 30 15:21:13 2013 MST
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c1f1f7 0x1d14b7a8
I'm still interested in recovering the pool if possible but am prepared to
rebuild it if necessary.
Thanks and happy holidays!
Mika

Post by Mika Anderson

Post by Jim Klimov

So far no ideas, except that did you try the "hacks" for not-aborting
on encountering such errors? I think it may be "zdb -AAA" and/or some
kernel-side flags like "aok" (search the internet archives for details).
I am not sure what that would give except for ZDB traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier TXG number (-t)
to see if there is an intact metadata tree that you can revert to by
losing some transactions committed later?

Thanks for the suggestions Jim.
Ive spent some time working with zdb and although I still dont have a
great understanding, some patterns have emerged.
Running zdb with switches -u (uberblock), -d (datasets), -i (intent logs)
magic = 0000000000bab10c
version = 5000
txg = 5996179
guid_sum = 4654748689984317566
timestamp = 1382409648 UTC = Tue Oct 22 02:40:48 2013
But running zdb with either -b (block statistics) or -c (checksum
metadata blocks) both fail with similar errors (note that there is more
than 5TB in this pool).
Traversing all blocks to verify nothing leaked ...
295M completed ( 296MB/s) estimated time remaining: 7hr 04min 22sec
1.57G completed ( 813MB/s) estimated time remaining: 2hr 34min 53sec
<..snip..>
70.0G completed (4859MB/s) estimated time remaining: 0hr 25min 41sec
71.1G completed (4630MB/s) estimated time remaining: 0hr 26min 57sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Traversing all blocks to verify metadata checksums and verify nothing
leaked ...
8.91M completed ( 9MB/s) estimated time remaining: 230hr 30min 36sec
11.2M completed ( 5MB/s) estimated time remaining: 367hr 11min 40sec
<..snip..>
70.4G completed ( 748MB/s) estimated time remaining: 2hr 46min 44sec
71.8G completed ( 756MB/s) estimated time remaining: 2hr 45min 02sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Running the same commands with -AAA (ignore assertions and enable panic
recovery) also fails
Traversing all blocks to verify nothing leaked ...
294M completed ( 300MB/s) estimated time remaining: 6hr 58min 55sec
1.54G completed ( 806MB/s) estimated time remaining: 2hr 36min 14sec
<..snip..>
70.0G completed (4566MB/s) estimated time remaining: 0hr 27min 20sec
71.3G completed (4379MB/s) estimated time remaining: 0hr 28min 29sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Next thing I tried was with -F (attempt automatic rewind within safe
assertion failed for thread 0xfffffd7ffbb0f240, thread-id 64: c <
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 244
Abort (core dumped)
Then I tried -t (highest txg to use when searching for uberblocks) and
took the one less than the txg listed by zdb -u. Basically the same error.
Traversing all blocks to verify nothing leaked ...
295M completed ( 298MB/s) estimated time remaining: 7hr 02min 42sec
1.80G completed ( 938MB/s) estimated time remaining: 2hr 14min 11sec
<..snip..>
70.0G completed (4552MB/s) estimated time remaining: 0hr 27min 25sec
71.8G completed (4392MB/s) estimated time remaining: 0hr 28min 24sec
bp->blk_pad[0] == 0, file ../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Lastly I tried using the most recent txg that was shown when doing zdb
Segmentation Fault (core dumped)
So at this point Im sort of stuck again. I feel like Im (slowly)
learning but clearly still have a long ways to go. Id still really like
to recover the pool, and I feel that its possible but Im not sure what
the next steps should be.
If anyone has any insight or can make some more suggestions on what to
try, Id be very gracious. Also, if anyone has cautions on things which
may further damage the pool, that would be appreciated.
Cheers,
Mika

*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22129995-d617fb25> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

surya

2014-01-02 16:36:06 UTC

Permalink

Looking at the top 3 frames of the panic stack :
mutex_enter+0xb()
zio_buf_alloc+0x25(371600)
arc_get_data_buf+0x1d0(ffffff02e4522028)

Argument to zio_buf_alloc() is size and its more than 0x20000 i.e 128kb -
Kmem zio caches are there till 128kb - So as George suspected, it
ended up accessing some invalid location and panicked.

From the thread which initiated the import :
traverse_dataset+0x54(ffffff02e4212c00, 5b7e90, d, fffffffff79e1fb0,
ffffff02e0d0ca60)
traverse_pool+0x18a(ffffff02e1c80080, 5b7e90, d, fffffffff79e1fb0,
ffffff02e0d0ca60)
spa_load_verify+0x94(ffffff02e1c80080)
Its attempting to import the pool with txg: 0x5b7e90 [0t5996176] - and
as part of import it would try visiting the last 3txgs [for all the datasets
of the pool] to see they are intact - I would think, trying an import
with an
earlier txg could hopefully provide some help [You could use -T flag of
zpool cmd to specify a particular txg].
I would wait to see what others would say.
-Surya

Post by Mika Anderson
OmniOS 5.11 omnios-6de5e81 2013.11.27
0 on-line since 01/01/2014 16:02:53
1 off-line since 01/01/2014 16:17:45
savecore: System dump time: Wed Jan 1 16:19:02 2014
savecore: saving system crash dump in /var/crash/unknown/{unix,vmcore}.4
Constructing namelist /var/crash/unknown/unix.4
Constructing corefile /var/crash/unknown/vmcore.4
0:01 100% done: 143006 of 143006 pages saved
2030 (1%) zero pages were not written
0:01 dump decompress is done
-v\n::threadlist -v 10\n::msgbuf\n*panic_thread::findstack
-v\n::stacks' | mdb 4 > crash.4
ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
TIME UUID SUNW-MSG-ID
Jan 01 2014 16:19:47.050511000 ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
SUNOS-8000-KL
TIME CLASS ENA
Jan 01 16:19:47.0489 ireport.os.sunos.panic.dump_available
0x0000000000000000
Jan 01 16:19:44.2099 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
code = SUNOS-8000-KL
diag-time = 1388618387 49576
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
resource =
sw:///:path=/var/crash/unknown/.ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.4
os-instance-uuid = ff6bdbfb-9b00-ce8a-f3c5-d843828b567a
panicstr = BAD TRAP: type=d (#gp General protection)
rp=ffffff000f7e3dd0 addr=0
panicstack = unix:real_mode_stop_cpu_stage2_end+9de3
() | unix:trap+a30 () | unix:cmntrap+e6 () | unix:mutex_enter+b () |
zfs:zio_buf_alloc+25 () | zfs:arc_get_data_buf+1d0 () |
zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b () |
zfs:traverse_prefetcher+105 () | zfs:traverse_visitbp+271 () |
zfs:traverse_dnode+8b () | zfs:traverse_visitbp+536 () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_dnode+8b () | zfs:traverse_visitbp+5fd () |
zfs:traverse_prefetch_thread+79 () | genunix:taskq_d_thread+b7 () |
unix:thread_start+8 () |
crashtime = 1388618342
panic-time = Wed Jan 1 16:19:02 2014 MST
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c4a293 0x302bc98
Interpreting these messages is far beyond my knowledge.
Cheers.
Mika
What does the panic stack look like when you attempt ro import?
arc_get_data_buf()->zio_buf_alloc()->mutex_enter() panic doesn't seem
to indicate any onDisk issue - but more an in-kernel race. I would also try
to offline all but 1-cpu and retry the import.
-surya

Post by Mika Anderson
Sorry to revive this thread. I now have some time to look at
this issue and this pool is still a problem.
As a recap, this issue started when deleting about 2500 small
files from CIFS share over the network. About 2/3rds of the way
through the delete, OpenIndiana crashed and went into a reboot
loop -- each time it rebooted, it crashed. To troubleshoot the
pool, I installed OmniOS and worked with zpool and zdb gathering
information. Any attempt to mount the pool (even with -o
readonly=on) fails with a panic. Several zdb commands also cause
a panic.
Is anyone interested in looking at the crash dumps from an zpool
import command?
I created a panic omnios-6de5e81 2013.11.27 by running zpool
import -f pool_4k
Below is the result of the fmdump command I was prompted to run
upon reboot. The attached crash.2 was created by following the
http://wiki.illumos.org/display/illumos/How+To+Report+Problems
9fe80747-70f8-6cc0-d076-f86de1d808
30
TIME UUID
SUNW-MSG-ID
Dec 30 2013 15:21:43.487897000
9fe80747-70f8-6cc0-d076-f86de1d80830 SUNOS-8000-KL
TIME CLASS ENA
Dec 30 15:21:43.4854 ireport.os.sunos.panic.dump_available
0x0000000000000000
Dec 30 15:21:40.9688
ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 9fe80747-70f8-6cc0-d076-f86de1d80830
code = SUNOS-8000-KL
diag-time = 1388442103 486125
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
resource =
sw:///:path=/var/crash/unknown/.9fe80747-70f8-6cc0-d076-f86de1d80830
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.2
os-instance-uuid =
9fe80747-70f8-6cc0-d076-f86de1d80830
panicstr = BAD TRAP: type=d (#gp General
protection) rp=ffffff001097bdd0 addr=0
panicstack =
unix:real_mode_stop_cpu_stage2_end+9de3 () | unix:trap+a30 () |
unix:cmntrap+e6 () | unix:mutex_enter+b () | zfs:zio_buf_alloc+25
() | zfs:arc_get_data_buf+1d0 () | zfs:arc_buf_alloc+b5 () |
zfs:arc_read+42b () | zfs:traverse_prefetcher+105 () |
zfs:traverse_visitbp+271 () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+536 () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_visitbp+3fa () |
zfs:traverse_visitbp+3fa () | zfs:traverse_dnode+8b () |
zfs:traverse_visitbp+5fd () | zfs:traverse_prefetch_thread+79 ()
| genunix:taskq_d_thread+b7 () | unix:thread_start+8 () |
crashtime = 1388442073
panic-time = Mon Dec 30 15:21:13 2013 MST
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x52c1f1f7 0x1d14b7a8
I'm still interested in recovering the pool if possible but am
prepared to rebuild it if necessary.
Thanks and happy holidays!
Mika
On Mon, Nov 18, 2013 at 8:55 PM, Mika Anderson
On Mon, Nov 18, 2013 at 10:59 AM, Jim Klimov
As per Jim's suggestion, I ran "zdb -e -L pool_4k"
which returned
assertion failed for thread 0xfffffd7fff162a40,
dn->dn_nlevels <= 30 (0x21 <= 0x1e), file
../../../uts/common/fs/zfs/dnode.c, line 219
Abort (core dumped)
https://gist.github.com/ma1245/62890cb46c139d2b5c2d
Does anyone have insight on where I should go from here?
So far no ideas, except that did you try the "hacks" for
not-aborting
on encountering such errors? I think it may be "zdb -AAA"
and/or some
kernel-side flags like "aok" (search the internet
archives for details).
I am not sure what that would give except for ZDB
traversing somewhat
deeper into the (unreliable, probably) tree of metadata.
Also, is it possible to base your search on an earlier
TXG number (-t)
to see if there is an intact metadata tree that you can
revert to by
losing some transactions committed later?
Thanks for the suggestions Jim.
Ive spent some time working with zdb and although I still
dont have a great understanding, some patterns have emerged.
Running zdb with switches -u (uberblock), -d (datasets), -i
(intent logs) -h (pool history) -m (metaslabs) all return
magic = 0000000000bab10c
version = 5000
txg = 5996179
guid_sum = 4654748689984317566
timestamp = 1382409648 UTC = Tue Oct 22 02:40:48 2013
But running zdb with either -b (block statistics) or -c
(checksum metadata blocks) both fail with similar errors
(note that there is more than 5TB in this pool).
Traversing all blocks to verify nothing leaked ...
295M completed ( 296MB/s) estimated time remaining: 7hr 04min 22sec
1.57G completed ( 813MB/s) estimated time remaining: 2hr 34min 53sec
<..snip..>
70.0G completed (4859MB/s) estimated time remaining: 0hr 25min 41sec
71.1G completed (4630MB/s) estimated time remaining: 0hr
26min 57sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Traversing all blocks to verify metadata checksums and verify
nothing leaked ...
8.91M completed ( 9MB/s) estimated time remaining: 230hr 30min 36sec
11.2M completed ( 5MB/s) estimated time remaining: 367hr 11min 40sec
<..snip..>
70.4G completed ( 748MB/s) estimated time remaining: 2hr 46min 44sec
71.8G completed ( 756MB/s) estimated time remaining: 2hr
45min 02sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Running the same commands with -AAA (ignore assertions and
enable panic recovery) also fails
Traversing all blocks to verify nothing leaked ...
294M completed ( 300MB/s) estimated time remaining: 6hr 58min 55sec
1.54G completed ( 806MB/s) estimated time remaining: 2hr 36min 14sec
<..snip..>
70.0G completed (4566MB/s) estimated time remaining: 0hr 27min 20sec
71.3G completed (4379MB/s) estimated time remaining: 0hr
28min 29sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Next thing I tried was with -F (attempt automatic rewind
within safe range of transaction groups), this errored out
c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 244
Abort (core dumped)
Then I tried -t (highest txg to use when searching for
uberblocks) and took the one less than the txg listed by zdb
-u. Basically the same error.
Traversing all blocks to verify nothing leaked ...
295M completed ( 298MB/s) estimated time remaining: 7hr 02min 42sec
1.80G completed ( 938MB/s) estimated time remaining: 2hr 14min 11sec
<..snip..>
70.0G completed (4552MB/s) estimated time remaining: 0hr 27min 25sec
71.8G completed (4392MB/s) estimated time remaining: 0hr
28min 24sec assertion failed for thread 0xfffffd7fff162a40,
thread-id 1: bp->blk_pad[0] == 0, file
../../../uts/common/fs/zfs/zio.c, line 2845
Abort (core dumped)
Lastly I tried using the most recent txg that was shown when
Segmentation Fault (core dumped)
So at this point Im sort of stuck again. I feel like Im
(slowly) learning but clearly still have a long ways to go.
Id still really like to recover the pool, and I feel that
its possible but Im not sure what the next steps should be.
If anyone has any insight or can make some more suggestions
on what to try, Id be very gracious. Also, if anyone has
cautions on things which may further damage the pool, that
would be appreciated.
Cheers,
Mika
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>

*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22129995-d617fb25>
| Modify <https://www.listbox.com/member/?&> Your Subscription
[Powered by Listbox] <http://www.listbox.com>
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>

Continue reading on narkive:

Search results for 'problem importing pool' (Questions and Answers)

replies

Are ethical breeders breeding enough to support the gene pool?

started 2013-10-03 15:39:52 UTC

dogs

replies

big stars.... big problems?

started 2009-08-31 01:58:11 UTC

cricket

replies

Importing dogs (20 characters)?