Rich via illumos-zfs
2014-09-29 14:12:49 UTC
Hi all,
So, it turns out this is (AFAICS) unrelated to me running ZFS send,
and will eventually trigger when the pool is being used, regardless of
whether I'm using zfs send/recv or not.
Most of what I laughably call "notes" are in illumos #5065, I'm
presently trying to disentangle what's going on in the trace...
It looks like it's trying to figure out space accounting for the IO,
but some of the metadata is either mangled or there's an edge case in
the space calculation?
I'm unsure of how to step through debugging this further - I've found
the specific dataset that is always implicated (via printing the
dsl_dir_t at dsl_dir_diduse_space), but I don't know what the logical
flaw is, or how to see what the FS is trying to do when it encounters
this condition.
Would anyone be able to tell me how to look into debugging this
further? It's...a bit overwhelming, to be honest, the amount of state
and information that I need to digest to figure out where to look, and
I'm not finding any obvious leads...
Thanks,
- Rich
So, it turns out this is (AFAICS) unrelated to me running ZFS send,
and will eventually trigger when the pool is being used, regardless of
whether I'm using zfs send/recv or not.
Most of what I laughably call "notes" are in illumos #5065, I'm
presently trying to disentangle what's going on in the trace...
It looks like it's trying to figure out space accounting for the IO,
but some of the metadata is either mangled or there's an edge case in
the space calculation?
I'm unsure of how to step through debugging this further - I've found
the specific dataset that is always implicated (via printing the
dsl_dir_t at dsl_dir_diduse_space), but I don't know what the logical
flaw is, or how to see what the FS is trying to do when it encounters
this condition.
Would anyone be able to tell me how to look into debugging this
further? It's...a bit overwhelming, to be honest, the amount of state
and information that I need to digest to figure out where to look, and
I'm not finding any obvious leads...
Thanks,
- Rich
Hi all,
So, I've got a nice storage system that's been running for several
years on OI, relatively stably, with a few hiccups in mpt_sas.
Recently, I had occasion to try updating to illumos-gate 20140723
after a number of the problems I've encountered with mpt_sas panics on
drive failures. The panics on drive failures appear to have stopped...
...but I can now panic the system by doing a zfs send.
https://www.illumos.org/issues/5065 is the report I filed; it includes
a dump link in the first comment. I'll reproduce ::stack and
::panicinfo here for anyone who is searching on this rather than
clicking the link...
0xfffffffffbe0bf48()
dsl_dir_transfer_space+0xdd(ffffff23d1fb6b40, 4440, 3, 2, fffffff4cd8acc00)
dsl_dir_diduse_space+0x1b4(ffffff25008dd240, 0, 4440, 1000, 1000,
fffffff4cd8acc00)
dsl_dataset_block_born+0x196(ffffff25004d7c40, ffffffa000cf6380,
fffffff4cd8acc00)
dbuf_write_done+0x429(ffffff2d96943c28, ffffff464fdfa970, ffffffae0074bd48)
arc_write_done+0x20f(ffffff2d96943c28)
zio_done+0x422(ffffff2d96943c28)
zio_execute+0xd5(ffffff2d96943c28)
zio_notify_parent+0x113(ffffff2d96943c28, ffffff9ed0f20168, 1)
zio_done+0x480(ffffff9ed0f20168)
zio_execute+0xd5(ffffff9ed0f20168)
zio_notify_parent+0x113(ffffff9ed0f20168, ffffff5bfef380f0, 1)
zio_done+0x480(ffffff5bfef380f0)
zio_execute+0xd5(ffffff5bfef380f0)
zio_notify_parent+0x113(ffffff5bfef380f0, ffffff5194f68c08, 1)
zio_done+0x480(ffffff5194f68c08)
zio_execute+0xd5(ffffff5194f68c08)
taskq_thread+0x318(ffffff23c11f7e98)
thread_start+8()
thread ffffff00f7519c40
message assertion failed: delta > 0 ?
../../common/fs/zfs/dsl_dir.c, line: 1395
rdi fffffffffbf53ca8
rsi ffffff00f7519300
rdx fffffffff78f702c
rcx 573
r8 0
r9 fffffff4cd8acc00
rax ffffff00f7519320
rbx 4440
rbp ffffff00f7519360
r10 ffffffffffffbbc0
r11 0
r12 ffffff23d1fb6b40
r13 3
r14 2
r15 ffffff23d1fb6b90
fsbase 0
gsbase ffffff2330a43580
ds 4b
es 4b
fs 0
gs 1c3
trapno 0
err 0
rip fffffffffb869f10
cs 30
rflags 286
rsp ffffff00f75192f8
ss 38
gdt_hi 0
gdt_lo 700001ef
idt_hi 0
idt_lo 90000fff
ldt 0
task 70
cr0 8005003b
cr2 81052b4
cr3 4000000
cr4 6f8
Has anyone else encountered this?
Thanks!
- Rich
So, I've got a nice storage system that's been running for several
years on OI, relatively stably, with a few hiccups in mpt_sas.
Recently, I had occasion to try updating to illumos-gate 20140723
after a number of the problems I've encountered with mpt_sas panics on
drive failures. The panics on drive failures appear to have stopped...
...but I can now panic the system by doing a zfs send.
https://www.illumos.org/issues/5065 is the report I filed; it includes
a dump link in the first comment. I'll reproduce ::stack and
::panicinfo here for anyone who is searching on this rather than
clicking the link...
::stack
vpanic()0xfffffffffbe0bf48()
dsl_dir_transfer_space+0xdd(ffffff23d1fb6b40, 4440, 3, 2, fffffff4cd8acc00)
dsl_dir_diduse_space+0x1b4(ffffff25008dd240, 0, 4440, 1000, 1000,
fffffff4cd8acc00)
dsl_dataset_block_born+0x196(ffffff25004d7c40, ffffffa000cf6380,
fffffff4cd8acc00)
dbuf_write_done+0x429(ffffff2d96943c28, ffffff464fdfa970, ffffffae0074bd48)
arc_write_done+0x20f(ffffff2d96943c28)
zio_done+0x422(ffffff2d96943c28)
zio_execute+0xd5(ffffff2d96943c28)
zio_notify_parent+0x113(ffffff2d96943c28, ffffff9ed0f20168, 1)
zio_done+0x480(ffffff9ed0f20168)
zio_execute+0xd5(ffffff9ed0f20168)
zio_notify_parent+0x113(ffffff9ed0f20168, ffffff5bfef380f0, 1)
zio_done+0x480(ffffff5bfef380f0)
zio_execute+0xd5(ffffff5bfef380f0)
zio_notify_parent+0x113(ffffff5bfef380f0, ffffff5194f68c08, 1)
zio_done+0x480(ffffff5194f68c08)
zio_execute+0xd5(ffffff5194f68c08)
taskq_thread+0x318(ffffff23c11f7e98)
thread_start+8()
::panicinfo
cpu 9thread ffffff00f7519c40
message assertion failed: delta > 0 ?
../../common/fs/zfs/dsl_dir.c, line: 1395
rdi fffffffffbf53ca8
rsi ffffff00f7519300
rdx fffffffff78f702c
rcx 573
r8 0
r9 fffffff4cd8acc00
rax ffffff00f7519320
rbx 4440
rbp ffffff00f7519360
r10 ffffffffffffbbc0
r11 0
r12 ffffff23d1fb6b40
r13 3
r14 2
r15 ffffff23d1fb6b90
fsbase 0
gsbase ffffff2330a43580
ds 4b
es 4b
fs 0
gs 1c3
trapno 0
err 0
rip fffffffffb869f10
cs 30
rflags 286
rsp ffffff00f75192f8
ss 38
gdt_hi 0
gdt_lo 700001ef
idt_hi 0
idt_lo 90000fff
ldt 0
task 70
cr0 8005003b
cr2 81052b4
cr3 4000000
cr4 6f8
Has anyone else encountered this?
Thanks!
- Rich