Discussion:
mask pci-e errors
jason matthews
2014-01-13 17:59:20 UTC
Permalink
I have 40 identically configured systems that catch the pci-e error below. It seems that about every six months plus or minus, they go through a cycle where they generate this error usually all forty within about three weeks and they are good for months. Bad juju.

The systems are Intel SR2625URLXR, 9207-8i, Intel 910, and 9205-8e on L5630 CPUs with 96gb of ram. The result of the failure is that zfs and zpool commands commands hang on the intel 910 card. Regular file system disk I/O is okay, but zpool and zfs commands hang.

I am looking for a work around as the storage continues to work for applications despite the error. Perhaps the error could be masked before FMD takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't know. The hang up seems to be in zfs where system thinks the storage is hosed and zfs/zpool commands hang. As I say regular file system I/Os work just peachy. Does anyone have any ideas on how to overcome this problem without rebooting?

I use clones of file systems to stand up short lived databases to run long batch queries against and when this happens i tend to have fairly crappy work day satisfaction.

Perhaps this is related to:
https://www.illumos.org/issues/315

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03652921-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=4091412&ac.admitted=1389635734908.876444892.492883150

It seems Oracle may have patched similar issues.
thanks,
j.


***@db020:~# fmadm faulty -ai
--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 08 13:47:15 2a74a865-ba4e-c3b0-e437-e0e34ba53623 PCIEX-8000-0A Critical

Host : db020
Platform : S5520UR Chassis_id : ............
Product_sn :

Fault class : fault.io.pciex.device-interr
Affects : dev:////***@0,0/pci8086,***@5/pci111d,***@0/pci111d,***@4/pci1000,***@0
faulted and taken out of service
FRU : "FH PCIE-SLOT2 x8" (hc://:product-id=S5520UR:server-id=db020:chassis-id=............/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=4/pciexdev=0)
faulty

Description : A problem was detected for a PCIEX device.
Refer to http://sun.com/msg/PCIEX-8000-0A for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with
this fault

Action : Schedule a repair procedure to replace the affected device. Use
fmadm faulty to identify the device or contact Sun for support.
Keith Wesolowski
2014-01-13 18:48:06 UTC
Permalink
Post by jason matthews
I am looking for a work around as the storage continues to work for applications despite the error. Perhaps the error could be masked before FMD takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't know. The hang up seems to be in zfs where system thinks the storage is hosed and zfs/zpool commands hang. As I say regular file system I/Os work just peachy. Does anyone have any ideas on how to overcome this problem without rebooting?
An investigation here needs to start by getting to root cause. Have you
examined the ereports that led to this diagnosis? Once you do, you may
be able to figure out why they're being generated and correct the
problem, perhaps by flipping a firmware switch. This is almost
certainly due to either a hardware erratum or a firmware bug; it's not
something that software is going to just fabricate on its own.
Understanding the cause will likely lead to the solution.

If you really want to just "make it go away", you can delete or disable
the io-retire module. That will preclude retiring the device, which is
what's causing things to hang. Of course, that should not cause hangs,
either; it should cause failures, so it's really worthwhile to
investigate that too and file bugs.

I believe there are also some parameters you can use to make the kernel
ignore these errors, but I no longer have my notes on that. Sorry for
being vague.
Richard Elling
2014-01-14 01:50:14 UTC
Permalink
Post by jason matthews
I have 40 identically configured systems that catch the pci-e error below. It seems that about every six months plus or minus, they go through a cycle where they generate this error usually all forty within about three weeks and they are good for months. Bad juju.
The systems are Intel SR2625URLXR, 9207-8i, Intel 910, and 9205-8e on L5630 CPUs with 96gb of ram. The result of the failure is that zfs and zpool commands commands hang on the intel 910 card. Regular file system disk I/O is okay, but zpool and zfs commands hang.
That server has a number of options for active and passive PCIe risers and not
all slots are created equal. Depending on the riser, you might have to change
slot configurations to clear. It is possible to look at the fmtopo and determine
how many pci bridges are in the way.
— richard
Post by jason matthews
I am looking for a work around as the storage continues to work for applications despite the error. Perhaps the error could be masked before FMD takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't know. The hang up seems to be in zfs where system thinks the storage is hosed and zfs/zpool commands hang. As I say regular file system I/Os work just peachy. Does anyone have any ideas on how to overcome this problem without rebooting?
I use clones of file systems to stand up short lived databases to run long batch queries against and when this happens i tend to have fairly crappy work day satisfaction.
https://www.illumos.org/issues/315
http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03652921-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=4091412&ac.admitted=1389635734908.876444892.492883150
It seems Oracle may have patched similar issues.
thanks,
j.
--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 08 13:47:15 2a74a865-ba4e-c3b0-e437-e0e34ba53623 PCIEX-8000-0A Critical
Host : db020
Platform : S5520UR Chassis_id : ............
Fault class : fault.io.pciex.device-interr
faulted and taken out of service
FRU : "FH PCIE-SLOT2 x8" (hc://:product-id=S5520UR:server-id=db020:chassis-id=............/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=4/pciexdev=0)
faulty
Description : A problem was detected for a PCIEX device.
Refer to http://sun.com/msg/PCIEX-8000-0A for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : Schedule a repair procedure to replace the affected device. Use
fmadm faulty to identify the device or contact Sun for support.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
jason matthews
2014-01-14 04:52:06 UTC
Permalink
Post by Richard Elling
That server has a number of options for active and passive PCIe risers and not
all slots are created equal. Depending on the riser, you might have to change
slot configurations to clear. It is possible to look at the fmtopo and determine
how many pci bridges are in the way.
— richard
Thanks Richard.

I am using active riser to get all the HBAs I need. The passive riser only supports two or three PCI-e slots. In the interest of complete transparency I replaced the active mid plane with the passive mid plane for the drives.

This is actually a tricky wick. I forgot to mention those systems also have DDRdrive X1s. The current configuration is the only configuration that allowed all of the PCI-e cards to work together. The tipping point was adding the Intel 910. I literally had to create a truth table for all the possible permutations and test them to find one that worked. Other configurations resulted in missing HBA or other more erratic behavior.



j.


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Rich
2014-01-14 12:12:42 UTC
Permalink
What's the power draw on the X1 + 910 + HBAs? I wonder if you're trying to
suck more power from the PCIe slots than is being provided, at peak,
leading to..."exciting" results.

- Rich
Post by Richard Elling
That server has a number of options for active and passive PCIe risers and not
all slots are created equal. Depending on the riser, you might have to change
slot configurations to clear. It is possible to look at the fmtopo and determine
how many pci bridges are in the way.
— richard
Thanks Richard.
I am using active riser to get all the HBAs I need. The passive riser only
supports two or three PCI-e slots. In the interest of complete transparency
I replaced the active mid plane with the passive mid plane for the drives.
This is actually a tricky wick. I forgot to mention those systems also
have DDRdrive X1s. The current configuration is the only configuration that
allowed all of the PCI-e cards to work together. The tipping point was
adding the Intel 910. I literally had to create a truth table for all the
possible permutations and test them to find one that worked. Other
configurations resulted in missing HBA or other more erratic behavior.
j.
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24536931-5d25148d> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
jason matthews
2014-01-14 21:41:08 UTC
Permalink
What's the power draw on the X1 + 910 + HBAs? I wonder if you're trying to suck more power from the PCIe slots than is being provided, at peak, leading to..."exciting" results.
that is an interesting idea but i think i am okay there.

X1 draw is about 10w max
9205-8e is about 15w max
9211-8i is about 15w max
910 is about about 25w max

that puts me at 65w max for my add ons.

relative to some video cards that doesn't seem like much and that is probably the max for low end consumer boards. however, i can't find any specifications for how much power can be delivered through the pci-e bus for s5520ur m/b. it is not exactly a cheap consumer piece of crap, the thing is really well made. there are however published specs for total power consumption per voltage rail which seems to indicate would be plenty of power available. The power draw for the whole server (less the external disk shelves) is sub 255w (or 14-16% per p/s for the inventory) and the thermal margins on all the sensors are excellent.

what do you think? i don't have a good grasp of what typical draw is on a server board in these cases. maybe i can engage intel.

j.
Garrett D'Amore
2014-01-15 00:27:11 UTC
Permalink
You may be pushing it.

25w is the maximum for PCI cards (legacy). PCIe can provide up to 75w
without relying on external power connectors, but that's for a 16X card.
4x and 8x cards are limited to 25w and x1 cards are limited to 10w for
half height, possibly up to 25W for full height.

Video cards use different 6 or 8 pin separate connectors to get 75w or 150w
to the card. That's not applicable, because that power comes directly from
the PSU rather than over the mainboard.

My guess is that you're probably drawing at near the limit of what this
mainboard can handle. While normally this shouldn't be a problem, if
you're at or above max for all the devices in the system, I would not be
surprised if you are over power at peak load. If the mainboard is at all
marginal, this can even make things worse.

I'd look to see if you can move to a mainboard with more slots (giving a
higher total power draw that it should be able to support), or reduce or
eliminate one or more of the options.

I'm surprised to see *both* the X1 and the 910, in particular. I imagine
the X1 is your SLOG, in which case you are probably using the 910 for
L2ARC? If so, why not just have that on normal SAS connected SSD? It
would have probably been cheaper and since they have separate power
supplies (SAS / SATA power are not routed over the data bus, but use
separate power lines that go right to the PSU), you'd be free of this
problem.

Anyway, my thought is that you're over power, or so close to the limit of
the specification that peak loads are killing you.
Post by Rich
What's the power draw on the X1 + 910 + HBAs? I wonder if you're trying
to suck more power from the PCIe slots than is being provided, at peak,
leading to..."exciting" results.
that is an interesting idea but i think i am okay there.
X1 draw is about 10w max
9205-8e is about 15w max
9211-8i is about 15w max
910 is about about 25w max
that puts me at 65w max for my add ons.
relative to some video cards that doesn't seem like much and that is
probably the max for low end consumer boards. however, i can't find any
specifications for how much power can be delivered through the pci-e bus
for s5520ur m/b. it is not exactly a cheap consumer piece of crap, the
thing is really well made. there are however published specs for total
power consumption per voltage rail which seems to indicate would be plenty
of power available. The power draw for the whole server (less the external
disk shelves) is sub 255w (or 14-16% per p/s for the inventory) and the
thermal margins on all the sensors are excellent.
what do you think? i don't have a good grasp of what typical draw is on a
server board in these cases. maybe i can engage intel.
j.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
jason matthews
2014-01-15 04:06:08 UTC
Permalink
On Jan 14, 2014, at 4:27 PM, "Garrett D'Amore" <***@damore.org> wrote:

Thanks for the response.
I'm surprised to see *both* the X1 and the 910, in particular. I imagine the X1 is your SLOG, in which case you are probably using the 910 for L2ARC? If so, why not just have that on normal SAS connected SSD? It would have probably been cheaper and since they have separate power supplies (SAS / SATA power are not routed over the data bus, but use separate power lines that go right to the PSU), you'd be free of this problem.
The systems run multiple master and slave postgres systems, one per zone. The masters run on the 910s and the slaves run against two nds-2441 disk shelves that they share with a sister system. Each system gets 1/2 of each shelf, configured as mirrors. So twelve sets of mirrors of 15k rpm drives for each system.

The systems were initially deployed with the x1 as slog for the spinning rust. Over time, the spinning rust was over run by i/o - the latency just got to high. Postgres now does in excess of 1M inserts/hour/zone and is limited by the software used to feed postgres the data. The 910s were added and the masters were migrated to the 910s. The slaves stayed on the shelves.

The 910s are now running out of space and I will be moving to a mirrored dcs3700 setup as soon as i can complete a data center expansion. Basically, I am a efficient (cheap?) bastard and cram the most into a system I can. The power whips all run exactly at 80% - nothing goes to waste ;-)
Anyway, my thought is that you're over power, or so close to the limit of the specification that peak loads are killing you.
I guess it is possible. The power budget section of the motherboard technical spec would lead one to a different conclusion.

i have support contracts on the gear in Ashburn. i will open a case on one of those and see if i can get a determination from Intel.

thanks,
j.


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Continue reading on narkive:
Loading...