Discussion:
Panic upon zpool import?
Dan McDonald
2014-04-09 20:57:22 UTC
Permalink
Pardon the screenshots, but it's the best I can do with this damned console.

We had a system panic, and then it will now panic upon import. We've another machine configured just like it, so I can answer details about pools, filesystems , etc.

I've attached four kmdb screen shots. The arc_buf_hdr_t seems suspect to me, and seems to cause the panic.

I can provide more information on request. I'm on irc under 'danmcd' on #illumos as well.

THanks,
Dan



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
George Wilson
2014-04-09 21:14:09 UTC
Permalink
Dan,

The size is the problem. This is trying to allocate a buf that is 846K
but the largest block we support is 128K. This causes us to blow up in
kmem_cache_alloc. Definitely need to find out where this arc_buf_hdr_t
came from.

Thanks,
George
Post by Dan McDonald
Pardon the screenshots, but it's the best I can do with this damned console.
We had a system panic, and then it will now panic upon import. We've
another machine configured just like it, so I can answer details about
pools, filesystems , etc.
I've attached four kmdb screen shots. The arc_buf_hdr_t seems suspect
to me, and seems to cause the panic.
I can provide more information on request. I'm on irc under 'danmcd' on #illumos as well.
THanks,
Dan
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ahmed Kamal
2014-04-09 21:20:25 UTC
Permalink
wrt the panic message, the Linux crew were tinkering with the cool idea of
encoding panics into a QR code (possible with submitting to an online
backend). Sounds like a cool and useful idea, just throwing this out there
if any Illumos hacker wants to take a shot :)
Dan,
The size is the problem. This is trying to allocate a buf that is 846K but
the largest block we support is 128K. This causes us to blow up in
kmem_cache_alloc. Definitely need to find out where this arc_buf_hdr_t came
from.
Thanks,
George
Pardon the screenshots, but it's the best I can do with this damned console.
We had a system panic, and then it will now panic upon import. We've
another machine configured just like it, so I can answer details about
pools, filesystems , etc.
I've attached four kmdb screen shots. The arc_buf_hdr_t seems suspect to
me, and seems to cause the panic.
I can provide more information on request. I'm on irc under 'danmcd' on #illumos as well.
THanks,
Dan
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4> |
Modify <https://www.listbox.com/member/?&> Your Subscription <http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24086556-43c7f431> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2014-04-09 21:31:45 UTC
Permalink
Post by Ahmed Kamal
wrt the panic message, the Linux crew were tinkering with the cool idea
of encoding panics into a QR code (possible with submitting to an online
backend). Sounds like a cool and useful idea, just throwing this out
there if any Illumos hacker wants to take a shot :)
I've been thinking about this and it's a nice concept, but kind of
useless for Illumos, seeing as we have real crash dump support.
Naturally, if the dump device is on ZFS and the crash occurred in ZFS,
then it's kind of a no-go, but in general, Illumos doesn't need this
facility. Perhaps a simple work-around for the limitation of
dump-to-ZFS-without-ZFS would be better. Just thinking...

Cheers,
--
Saso
Matthew Ahrens
2014-04-09 22:06:59 UTC
Permalink
Post by Saso Kiselkov
Post by Ahmed Kamal
wrt the panic message, the Linux crew were tinkering with the cool idea
of encoding panics into a QR code (possible with submitting to an online
backend). Sounds like a cool and useful idea, just throwing this out
there if any Illumos hacker wants to take a shot :)
I've been thinking about this and it's a nice concept, but kind of
useless for Illumos, seeing as we have real crash dump support.
Naturally, if the dump device is on ZFS and the crash occurred in ZFS,
then it's kind of a no-go, but in general, Illumos doesn't need this
facility. Perhaps a simple work-around for the limitation of
dump-to-ZFS-without-ZFS would be better. Just thinking...
In general, you can get dumps on ZFS even when ZFS crashes. When the
system boots, it "dumpifys" the dump zvol, recording all its DVAs in
memory. Then when we dump, we just look at the already-recorded DVAs,
ignoring the rest of the ZFS code. If Dan's busted pool was not the root
pool, his dump would be readable just fine.

--matt
Post by Saso Kiselkov
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Dan McDonald
2014-04-09 22:14:41 UTC
Permalink
In general, you can get dumps on ZFS even when ZFS crashes. When the system boots, it "dumpifys" the dump zvol, recording all its DVAs in memory. Then when we dump, we just look at the already-recorded DVAs, ignoring the rest of the ZFS code. If Dan's busted pool was not the root pool, his dump would be readable just fine.
It's the data pool that's corrupt, not rpool.

Dan
Matthew Ahrens
2014-04-09 23:05:33 UTC
Permalink
Post by Matthew Ahrens
Post by Matthew Ahrens
In general, you can get dumps on ZFS even when ZFS crashes. When the
system boots, it "dumpifys" the dump zvol, recording all its DVAs in
memory. Then when we dump, we just look at the already-recorded DVAs,
ignoring the rest of the ZFS code. If Dan's busted pool was not the root
pool, his dump would be readable just fine.
It's the data pool that's corrupt, not rpool.
Then you should be able to boot "-m milestone=none" (append that to the
kernel line in grub), and run savecore to get the crash dump. Or reproduce
the bug by running zpool import and then getting the crash dump. You may
need to remove /etc/zfs/zpool.cache when you are in single-user mode to
ensure you get far enough that the dump device is configured before you
crash again.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Dan McDonald
2014-04-09 21:26:06 UTC
Permalink
Dan,
The size is the problem. This is trying to allocate a buf that is 846K but the largest block we support is 128K. This causes us to blow up in kmem_cache_alloc. Definitely need to find out where this arc_buf_hdr_t came from.
I didn't want to say the size, but I'm glad to hear you say it.

Two more screenshots --> one with the whole stack, sans arguments. It looks like it read something off the disk. The second is me going up the stack to arc_read, where the blkptr_t (arg2 = 0xffffff3476c81000) seems to have *text* installed where there should be perhaps more binary-coded data like a proper size?

Dan





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Dan McDonald
2014-04-09 21:54:37 UTC
Permalink
Post by Dan McDonald
Two more screenshots --> one with the whole stack, sans arguments. It looks like it read something off the disk. The second is me going up the stack to arc_read, where the blkptr_t (arg2 = 0xffffff3476c81000) seems to have *text* installed where there should be perhaps more binary-coded data like a proper size?
Yeah, and the text appears to be byteswapped in 16-bit chunks. "INTEL SSD" is "NIET LSS...", for example.

Dan
Dan McDonald
2014-04-09 21:55:19 UTC
Permalink
Ahhh, one more thing:

arc.c: line 2970 --> size is computed from the blkptr_t. Should there be an EIO returned if size is insane, as a preventative?!

Dan
Dale Ghent
2014-04-09 22:07:58 UTC
Permalink
Post by Dan McDonald
Post by Dan McDonald
Two more screenshots --> one with the whole stack, sans arguments. It looks like it read something off the disk. The second is me going up the stack to arc_read, where the blkptr_t (arg2 = 0xffffff3476c81000) seems to have *text* installed where there should be perhaps more binary-coded data like a proper size?
Yeah, and the text appears to be byteswapped in 16-bit chunks. "INTEL SSD" is "NIET LSS...", for example.
For what it’s worth to the readers here, this is the VID and PID of one of the drives in the zpool. Pretty funky?

/dale
Richard Elling
2014-04-11 17:31:52 UTC
Permalink
Post by Dan McDonald
Post by Dan McDonald
Two more screenshots --> one with the whole stack, sans arguments. It looks like it read something off the disk. The second is me going up the stack to arc_read, where the blkptr_t (arg2 = 0xffffff3476c81000) seems to have *text* installed where there should be perhaps more binary-coded data like a proper size?
Yeah, and the text appears to be byteswapped in 16-bit chunks. "INTEL SSD" is "NIET LSS...", for example.
For what it’s worth to the readers here, this is the VID and PID of one of the drives in the zpool. Pretty funky?
sd tends to keeps this sort of inquiry data in memory, so it might be happenstance that you end up there.
If it was busted firmware (wouldn't be the first time :-() then it should fail the checksum.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Marion Hakanson
2014-04-10 01:49:13 UTC
Permalink
Post by Ahmed Kamal
wrt the panic message, the Linux crew were tinkering with the cool idea of
encoding panics into a QR code (possible with submitting to an online
backend). Sounds like a cool and useful idea, just throwing this out there if
any Illumos hacker wants to take a shot :)
Doesn't sound so cool to me.

Does this mean one would be required to have a smart-phone in order
to read a panic message?

If this approach does become common, I hope the old-school plain-text
panic messages will be retained, for those of us who prefer serial
consoles (whether actual RS232 or Serial-Over-LAN), or who have
hardware without some form of graphics.

Regards,

Marion
Garrett D'Amore
2014-04-10 05:12:35 UTC
Permalink
To be honest, I thought of this QR code as mostly a joke.  I’m a kernel developer, and I have never used QR codes for *anything*.  I’d be vehemently opposed (as I think most kernel devs would be of a similar mind) to anything that makes getting access to kernel debug information even more indirect.  Trying to debug an illumos panic on a mobile phone sounds
 silly.
-- 
Garrett D'Amore
Sent with Airmail
Post by Ahmed Kamal
wrt the panic message, the Linux crew were tinkering with the cool idea of
encoding panics into a QR code (possible with submitting to an online
backend). Sounds like a cool and useful idea, just throwing this out there if
any Illumos hacker wants to take a shot :)
Doesn't sound so cool to me.

Does this mean one would be required to have a smart-phone in order
to read a panic message?

If this approach does become common, I hope the old-school plain-text
panic messages will be retained, for those of us who prefer serial
consoles (whether actual RS232 or Serial-Over-LAN), or who have
hardware without some form of graphics.

Regards,

Marion




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Stuart Remphrey
2014-04-10 08:47:17 UTC
Permalink
Re: QR codes

Would the idea be to push the core dump (linux net dump) somewhere
accessible, then include the URL and QR code for that URL in the stack dump
display? That would be more-or-less reasonable.

Would still need the ASCII terminal / (virtual) serial line support though!
:-)


Rgds, Stuart.
To be honest, I thought of this QR code as mostly a joke. I’m a kernel
developer, and I have never used QR codes for *anything*. I’d be
vehemently opposed (as I think most kernel devs would be of a similar mind)
to anything that makes getting access to kernel debug information even more
indirect. Trying to debug an illumos panic on a mobile phone sounds
 silly.
--
Garrett D'Amore
Sent with Airmail
Post by Ahmed Kamal
wrt the panic message, the Linux crew were tinkering with the cool idea
of
Post by Ahmed Kamal
encoding panics into a QR code (possible with submitting to an online
backend). Sounds like a cool and useful idea, just throwing this out
there if
Post by Ahmed Kamal
any Illumos hacker wants to take a shot :)
Doesn't sound so cool to me.
Does this mean one would be required to have a smart-phone in order
to read a panic message?
If this approach does become common, I hope the old-school plain-text
panic messages will be retained, for those of us who prefer serial
consoles (whether actual RS232 or Serial-Over-LAN), or who have
hardware without some form of graphics.
Regards,
Marion
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22392398-58106572> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Loading...