Discussion:
ZFS pool failing on Dell MD1200
Edwards, Nick, Vodafone Group
2013-11-07 16:24:35 UTC
Permalink
Hi,

I am using OmniOS on a Dell Poweredge 1950 along with a Dell H800 raid controller (LSI MegaRAID SAS 2108) and a Dell MD1200 12 disk enclosure.

10 of the disks are configured in raid 0 mode (as there is no option for JBOD) and added to a raidz pool. Various zfs file systems have then been created. We are using the server for backups and some file storage.

Everything seems fine however every week or two, the whole MD1200 enclosure seems to get disconnected and I cannot get it back without rebooting the server. I get the following messages in the logs over and over again when this happens:

Nov 7 14:33:49 omnios mr_sas: [ID 270009 kern.warning] WARNING: io_timeout_checker: FW Fault, calling reset adapter
Nov 7 14:33:49 omnios mr_sas: [ID 643100 kern.notice] io_timeout_checker: fw_outstanding 0x1 max_fw_cmds 0x3EF
Nov 7 14:33:49 omnios mr_sas: [ID 218520 kern.warning] WARNING: mrsas_reset_ppc: no more resets as HBA has been marked dead

I am really not sure what the problem is here, does anyone have any ideas, it is very frustrating?

Many thanks,

Nick



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-11-07 16:31:59 UTC
Permalink
Post by Edwards, Nick, Vodafone Group
Hi,
I am using OmniOS on a Dell Poweredge 1950 along with a Dell H800 raid
controller (LSI MegaRAID SAS 2108) and a Dell MD1200 12 disk enclosure.
10 of the disks are configured in raid 0 mode (as there is no option for
JBOD) and added to a raidz pool. Various zfs file systems have then been
created. We are using the server for backups and some file storage.
Everything seems fine however every week or two, the whole MD1200
enclosure seems to get disconnected and I cannot get it back without
rebooting the server. I get the following messages in the logs over and
io_timeout_checker: FW Fault, calling reset adapter
Nov 7 14:33:49 omnios mr_sas: [ID 643100 kern.notice]
io_timeout_checker: fw_outstanding 0x1 max_fw_cmds 0x3EF
mrsas_reset_ppc: no more resets as HBA has been marked dead
I am really not sure what the problem is here, does anyone have any
ideas, it is very frustrating?
Seems like the Dell H800 controller is giving up the ghost here (it's
firmware appears to hang). Can you try swapping it out with another one?
This being HW RAID you obviously can't replace it with an entirely
different HBA altogether.

Cheers,
--
Saso
Dan McDonald
2013-11-07 16:39:08 UTC
Permalink
io_timeout_checker: FW Fault, calling reset adapter
fw_outstanding 0x1 max_fw_cmds 0x3EF
mrsas_reset_ppc: no more resets as HBA has been marked dead

You're timing out I/Os and this adapter was marked dead previously (likely by
other timeouts). Because the 2108 is HW-RAID only, you're unfortunately
dependent on the card itself to tell you what's going on. I wonder what the
BIOS would say?

And you mentioned the enclosure is getting disconnected --> because it's
HW-RAID (even single-disk RAID0), the controller takes it upon itself to
offline the drives or enclosure.

I'd HIGHLY recommend moving off the 2108 when you can. I know that's not a
great answer, but there's a reason Nexenta (a storage company) doesn't
recommend 2108s for data pools, and that's because HW-RAID tries to do too
much.

Dan
Rocky Shek
2013-11-07 18:01:48 UTC
Permalink
I will recommend you to use LSI 9207-8E

http://www.lsi.com/products/host-bus-adapters/pages/lsi-sas-9207-8e.aspx

Rocky
-----Original Message-----
From: Dan McDonald [mailto:***@nexenta.com]
Sent: Thursday, November 07, 2013 8:39 AM
To: ***@lists.illumos.org
Subject: Re: [zfs] ZFS pool failing on Dell MD1200

On Thu, Nov 07, 2013 at 04:24:35PM +0000, Edwards, Nick, Vodafone Group
io_timeout_checker: FW Fault, calling reset adapter
fw_outstanding 0x1 max_fw_cmds 0x3EF
mrsas_reset_ppc: no more resets as HBA has been marked dead

You're timing out I/Os and this adapter was marked dead previously (likely
by other timeouts). Because the 2108 is HW-RAID only, you're unfortunately
dependent on the card itself to tell you what's going on. I wonder what the
BIOS would say?

And you mentioned the enclosure is getting disconnected --> because it's
HW-RAID (even single-disk RAID0), the controller takes it upon itself to
offline the drives or enclosure.

I'd HIGHLY recommend moving off the 2108 when you can. I know that's not a
great answer, but there's a reason Nexenta (a storage company) doesn't
recommend 2108s for data pools, and that's because HW-RAID tries to do too
much.

Dan


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182191/23273037-0db2e07c
Modify Your Subscription:
https://www.listbox.com/member/?&
0f
Powered by Listbox: http://www.listbox.com
Edwards, Nick, Vodafone Group
2013-11-12 15:33:46 UTC
Permalink
Thank you all for your responses, very helpful.

After looking it appears that there is no IT firmware for this card. As this is the case I will persist with it but will use the hardware raid and configure with raid 5. I also have a brand new spare card which I may also use just in case there is an issue with the card.

I appreciate that by using the hardware raid 5 on the card I will lose the integrity and error checking functionality of ZFS, but this is a sacrifice I am willing to make as long as everything else works fine.

Last thing I have just noticed that OmniOS has reverted to an old version of mr_sas for stability reasons, don't know if this is related at all, probably not.

Thanks,

Nick

-----Original Message-----
From: Rocky Shek [mailto:***@dataonstorage.com]
Sent: 07 November 2013 18:02
To: ***@lists.illumos.org
Subject: RE: [zfs] ZFS pool failing on Dell MD1200



I will recommend you to use LSI 9207-8E

http://www.lsi.com/products/host-bus-adapters/pages/lsi-sas-9207-8e.aspx

Rocky
-----Original Message-----
From: Dan McDonald [mailto:***@nexenta.com]
Sent: Thursday, November 07, 2013 8:39 AM
To: ***@lists.illumos.org
Subject: Re: [zfs] ZFS pool failing on Dell MD1200

On Thu, Nov 07, 2013 at 04:24:35PM +0000, Edwards, Nick, Vodafone Group
io_timeout_checker: FW Fault, calling reset adapter
fw_outstanding 0x1 max_fw_cmds 0x3EF
mrsas_reset_ppc: no more resets as HBA has been marked dead

You're timing out I/Os and this adapter was marked dead previously (likely by other timeouts). Because the 2108 is HW-RAID only, you're unfortunately dependent on the card itself to tell you what's going on. I wonder what the BIOS would say?

And you mentioned the enclosure is getting disconnected --> because it's HW-RAID (even single-disk RAID0), the controller takes it upon itself to offline the drives or enclosure.

I'd HIGHLY recommend moving off the 2108 when you can. I know that's not a great answer, but there's a reason Nexenta (a storage company) doesn't recommend 2108s for data pools, and that's because HW-RAID tries to do too much.

Dan


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182191/23273037-0db2e07c
Modify Your Subscription:
https://www.listbox.com/member/?&
0f
Powered by Listbox: http://www.listbox.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/25160004-6b3e8bbd
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Eric Sproul
2013-11-12 16:21:54 UTC
Permalink
On Tue, Nov 12, 2013 at 10:33 AM, Edwards, Nick, Vodafone Group
Post by Edwards, Nick, Vodafone Group
Last thing I have just noticed that OmniOS has reverted to an old version of mr_sas for stability reasons, don't know if this is related at all, probably not.
If you're referring to
https://github.com/omniti-labs/illumos-omnios/commit/f1a23f88c2f9737b3260bf9a251f974d2e3e3db9
that is an upstream change that backs out an old wad of locking
changes to mpt_sas, not mr_sas.

The illumos issue with explanation is https://www.illumos.org/issues/4013

Nothing has changed in mr_sas, and both that and mpt_sas are
unmodified by OmniTI from upstream illumos.

Eric
Saso Kiselkov
2013-11-12 16:32:47 UTC
Permalink
Post by Edwards, Nick, Vodafone Group
Thank you all for your responses, very helpful.
After looking it appears that there is no IT firmware for this card. As this is the case I will persist with it but will use the hardware raid and configure with raid 5. I also have a brand new spare card which I may also use just in case there is an issue with the card.
I appreciate that by using the hardware raid 5 on the card I will lose the integrity and error checking functionality of ZFS, but this is a sacrifice I am willing to make as long as everything else works fine.
Last thing I have just noticed that OmniOS has reverted to an old version of mr_sas for stability reasons, don't know if this is related at all, probably not.
Since you mention RAID 5, I assume you're going to be rebuilding your
pool. If so, I strongly recommend looking into substituting a pure HBA
in place of the HW RAID card. You'll get all the benefits of ZFS with
none of the drawbacks of HW RAID, plus the HBAs are much cheaper (on the
order of $80-100 if you buy through a good channel):
http://accessories.euro.dell.com/sna/productdetail.aspx?c=uk&l=en&s=bsd&cs=ukbsdt1&sku=405-11482
(make sure you get the right PCI bracket, they're separate SKUs)
--
Saso
Schlacta, Christ
2013-11-07 16:38:29 UTC
Permalink
Try updating the firmware. Failing that, try flashing it to it mode. If
updating the firmware fails and it mode either fails or isn't an option,
try calling Dell technical support.
On Nov 7, 2013 8:25 AM, "Edwards, Nick, Vodafone Group" <
Post by Edwards, Nick, Vodafone Group
Hi,
I am using OmniOS on a Dell Poweredge 1950 along with a Dell H800 raid
controller (LSI MegaRAID SAS 2108) and a Dell MD1200 12 disk enclosure.
10 of the disks are configured in raid 0 mode (as there is no option for
JBOD) and added to a raidz pool. Various zfs file systems have then been
created. We are using the server for backups and some file storage.
Everything seems fine however every week or two, the whole MD1200
enclosure seems to get disconnected and I cannot get it back without
rebooting the server. I get the following messages in the logs over and
io_timeout_checker: FW Fault, calling reset adapter
fw_outstanding 0x1 max_fw_cmds 0x3EF
mrsas_reset_ppc: no more resets as HBA has been marked dead
I am really not sure what the problem is here, does anyone have any ideas,
it is very frustrating?
Many thanks,
Nick
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Keith Wesolowski
2013-11-07 17:03:50 UTC
Permalink
Post by Schlacta, Christ
Try updating the firmware. Failing that, try flashing it to it mode. If
updating the firmware fails and it mode either fails or isn't an option,
If you flash it to IT mode (which I believe is unsupported and will void
your warranty), you should be aware that you'll almost certainly lose
all data on the pool. RAID HBAs tend to lay out data on disk
differently -- at minimum there will be extra headers, etc. Only do
this if data loss is acceptable or if you have experience reflashing
this particular card in this particular configuration and recovering the
data afterward.
Post by Schlacta, Christ
try calling Dell technical support.
Or you could sacrifice a goat. It won't help either, but unlike calling
Dell, this option allows for the possibility of making a delicious
birria later on.
Schlacta, Christ
2013-11-07 17:34:16 UTC
Permalink
I don't
Post by Keith Wesolowski
Post by Schlacta, Christ
Try updating the firmware. Failing that, try flashing it to it mode. If
updating the firmware fails and it mode either fails or isn't an option,
If you flash it to IT mode (which I believe is unsupported and will void
your warranty), you should be aware that you'll almost certainly lose
all data on the pool. RAID HBAs tend to lay out data on disk
differently -- at minimum there will be extra headers, etc. Only do
this if data loss is acceptable or if you have experience reflashing
this particular card in this particular configuration and recovering the
data afterward.
Post by Schlacta, Christ
try calling Dell technical support.
Or you could sacrifice a goat. It won't help either, but unlike calling
Dell, this option allows for the possibility of making a delicious
birria later on.
I don't know how it works I the business world, but in the private sector,
when you call support for warranty issues on hardware that's misbehaving,
the established practice is to drop ship you a replacement piece that's
either tested and working or brand new, in the next outbound shipment. I
know Dell does this, as I've replaced several parts this way.



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ray Van Dolson
2013-11-07 18:04:05 UTC
Permalink
Post by Schlacta, Christ
Post by Keith Wesolowski
Post by Schlacta, Christ
Try updating the firmware. Failing that, try flashing it to it mode. If
updating the firmware fails and it mode either fails or isn't an option,
If you flash it to IT mode (which I believe is unsupported and will void
your warranty), you should be aware that you'll almost certainly lose
all data on the pool. RAID HBAs tend to lay out data on disk
differently -- at minimum there will be extra headers, etc. Only do
this if data loss is acceptable or if you have experience reflashing
this particular card in this particular configuration and recovering the
data afterward.
Post by Schlacta, Christ
try calling Dell technical support.
Or you could sacrifice a goat. It won't help either, but unlike calling
Dell, this option allows for the possibility of making a delicious
birria later on.
I don't know how it works I the business world, but in the private
sector, when you call support for warranty issues on hardware that's
misbehaving, the established practice is to drop ship you a
replacement piece that's either tested and working or brand new, in
the next outbound shipment. I know Dell does this, as I've replaced
several parts this way.
The key for us has been to make sure we're using Dell's ProSupport on
all our gear and that our Storage Admins are PowerVault and PowerEdge
certified (easy to do online). This way we can self dispatch (via
DOSD) on all part failures without needing to jump through Dell's
troubleshooting hoops. They often are confused by the way the JBOD is
being used and insist we run their diagnostic tools on things first.
It just gets confusing. :) Using DOSD bypasses all of that!

Ray
Steven Hartland
2013-11-07 17:20:18 UTC
Permalink
There are known timeout issues on older FW versions so make sure
your on the latest revision.

Regards
Steve
----- Original Message -----
From: "Edwards, Nick, Vodafone Group" <***@vodafone.com>
To: <***@lists.illumos.org>
Sent: Thursday, November 07, 2013 4:24 PM
Subject: [zfs] ZFS pool failing on Dell MD1200


Hi,

I am using OmniOS on a Dell Poweredge 1950 along with a Dell H800 raid controller (LSI MegaRAID SAS 2108) and a Dell MD1200 12
disk enclosure.

10 of the disks are configured in raid 0 mode (as there is no option for JBOD) and added to a raidz pool. Various zfs file systems
have then been created. We are using the server for backups and some file storage.

Everything seems fine however every week or two, the whole MD1200 enclosure seems to get disconnected and I cannot get it back
without rebooting the server. I get the following messages in the logs over and over again when this happens:

Nov 7 14:33:49 omnios mr_sas: [ID 270009 kern.warning] WARNING: io_timeout_checker: FW Fault, calling reset adapter
Nov 7 14:33:49 omnios mr_sas: [ID 643100 kern.notice] io_timeout_checker: fw_outstanding 0x1 max_fw_cmds 0x3EF
Nov 7 14:33:49 omnios mr_sas: [ID 218520 kern.warning] WARNING: mrsas_reset_ppc: no more resets as HBA has been marked dead

I am really not sure what the problem is here, does anyone have any ideas, it is very frustrating?

Many thanks,

Nick



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com


================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.
Loading...