Panic on read/write pool import, but not on readonly

Discussion:

Dan McDonald via illumos-zfs

2014-07-22 19:43:07 UTC

I apologize for any lack of information. This is on behalf of an OmniOS customer.

This customer has a pool of 8 disks. Two raidz1 vdevs of 4 disks each. Three disks in each raidz are attached to an mpt_sas controller, and one in each raidz is directly attached to AHCI sata. I know it's unusual, but run with me, please.

The customer thought one of the drives was reporting errors, so they yanked the drive. I'm not sure when the initial panic occurred, as their root pool was not able to provide a working zvol for a dump device (their rpool was attached to arcmsr... a different problem).

Eventually I managed to get a real coredump... three of them, actually: one from OmniOS r151008, one from r151010, and one from "bloody" (July 11) with a fresh-today ZFS module. All three misbehaved the same way when confronted with the importation of this pool read/write. IF I imported the pool with readonly, it imported. Also, cursory runs of zdb don't dump core.

::spa -v

ADDR STATE NAME
ffffff114dd07000 ACTIVE rpool

ADDR STATE AUX DESCRIPTION
ffffff113bbe9540 HEALTHY - root
ffffff113bbe9000 HEALTHY - /dev/dsk/c2t0d0s0
ffffff11aefbb000 ACTIVE zfs10

ffffff11a0910a80 HEALTHY - root
ffffff11a091f040 HEALTHY - raidz
ffffff11a091f580 HEALTHY - /dev/dsk/c4t4d0s0
ffffff1180cb4000 HEALTHY - /dev/dsk/c1t50014EE20A343375d0s0
ffffff1156861040 HEALTHY - /dev/dsk/c1t50014EE20A34683Cd0s0
ffffff11a091fac0 HEALTHY - /dev/dsk/c1t50014EE20A3477F4d0s0
ffffff11bcca5540 HEALTHY - /dev/dsk/c1t50014EE25F892E89d0s0
ffffff11bcca5000 HEALTHY - raidz
ffffff11bcca2ac0 HEALTHY - /dev/dsk/c4t5d0s0
ffffff11bcc9cac0 HEALTHY - /dev/dsk/c1t50014EE20A3479C5d0s0
ffffff11aee2d040 HEALTHY - /dev/dsk/c1t50014EE25F869F1Ad0s0
ffffff1181672040 HEALTHY - /dev/dsk/c1t50014EE2B4DE4716d0s0
ffffff11a0bbcac0 HEALTHY - /dev/dsk/c1t50014EE2B4DF963Dd0s0

::status

debugging crash dump vmcore.0 (64-bit) from zfs10
operating system: 5.11 omnios-8c08411 (i86pc)
image uuid: 973582c7-02ea-4f24-8fb4-ef1e6744f41c
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff007a9790f0 addr=90 occurred in module
"zfs" due to a NULL pointer dereference
dump content: kernel pages only

zio_vdev_child_io+0x4a(ffffff11823dca70, ffffff11823dcae0, 0, 723ec828000,
ffffff11bda85400, 200)
vdev_mirror_io_start+0x192(ffffff11823dca70)
zio_vdev_io_start+0x247(ffffff11823dca70)
zio_execute+0x88(ffffff11823dca70)
zio_nowait+0x21(ffffff11823dca70)
dsl_scan_scrub_cb+0x2b6(ffffff11ae43f080, ffffff1181d7fb80, ffffff007a9795d0)
dsl_scan_visitbp+0x175(ffffff1181d7fb80, ffffff007a9795d0, ffffff1181d7fa00,
ffffff1182301468, ffffff1156c34700, ffffff116c255000)
dsl_scan_visitdnode+0x121(ffffff116c255000, ffffff1156c34700, 2,
ffffff1181d7fa00, ffffff1182301468, 655cd)
dsl_scan_recurse+0x400(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a979770, ffffff007a9798c0)
dsl_scan_visitbp+0xef(ffffff1188d0c700, ffffff007a9798c0, ffffff11bd6c9800,
ffffff1182c3c590, ffffff1156c34700, ffffff116c255000)
dsl_scan_recurse+0x1fb(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a979980, ffffff007a979ad0)
dsl_scan_visitbp+0xef(ffffff11bca6b280, ffffff007a979ad0, ffffff11bd6c9800,
ffffff118891a240, ffffff1156c34700, ffffff116c255000)
dsl_scan_recurse+0x1fb(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a979b90, ffffff007a979ce0)
dsl_scan_visitbp+0xef(ffffff1181f82000, ffffff007a979ce0, ffffff11bd6c9800,
ffffff115686c0c8, ffffff1156c34700, ffffff116c255000)
dsl_scan_recurse+0x1fb(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a979da0, ffffff007a979ef0)
dsl_scan_visitbp+0xef(ffffff118250a000, ffffff007a979ef0, ffffff11bd6c9800,
ffffff1168071450, ffffff1156c34700, ffffff116c255000)
dsl_scan_recurse+0x1fb(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a979fb0, ffffff007a97a100)
dsl_scan_visitbp+0xef(ffffff116547c000, ffffff007a97a100, ffffff11bd6c9800,
ffffff11ae975350, ffffff1156c34700, ffffff116c255000)
dsl_scan_recurse+0x1fb(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a97a1c0, ffffff007a97a310)
dsl_scan_visitbp+0xef(ffffff1182626000, ffffff007a97a310, ffffff11bd6c9800,
ffffff1181cd9280, ffffff1156c34700, ffffff116c255000)
dsl_scan_recurse+0x1fb(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800,
ffffff007a97a3d0, ffffff007a97a4f0)
dsl_scan_visitbp+0xef(ffffff11bd6c9840, ffffff007a97a4f0, ffffff11bd6c9800,
ffffff118891a150, ffffff1156c34700, ffffff116c255000)
dsl_scan_visitdnode+0xbd(ffffff116c255000, ffffff1156c34700, 2, ffffff11bd6c9800
, ffffff118891a150, 0)
dsl_scan_recurse+0x439(ffffff116c255000, ffffff1156c34700, 0, 0,
ffffff007a97a690, ffffff007a97a7a0)
dsl_scan_visitbp+0xef(ffffff11bd547c80, ffffff007a97a7a0, 0, 0, ffffff1156c34700
, ffffff116c255000)
dsl_scan_visit_rootbp+0x61(ffffff116c255000, ffffff1156c34700, ffffff11bd547c80
, ffffff11bb5ab7c0)
dsl_scan_visitds+0xa0(ffffff116c255000, 1b4, ffffff11bb5ab7c0)
dsl_scan_visit+0x65(ffffff116c255000, ffffff11bb5ab7c0)
dsl_scan_sync+0x12f(ffffff11ae43f080, ffffff11bb5ab7c0)
spa_sync+0x334(ffffff11aefbb000, 189db3)
txg_sync_thread+0x227(ffffff11ae43f080)
thread_start+8()
I can probably get back on this customer's machine, but I was wondering if any of the above tickled anyone's memories? Searching for these functions in the illumos bug database didn't yield a lot.

If you have OmniOS r151010 (aka. our current "stable" build), you can inspect the system dump yourself! I can also provide output from any other mdb dcmds.

Thanks,
Dan

George Wilson via illumos-zfs

2014-07-22 19:50:29 UTC