Retiring a slow drive

Discussion:

Retiring a slow drive

Ian Collins

2013-10-07 07:04:25 UTC

I can't remember if the topic has come up before (I'm sure it probably
has) but I'd like to know if any work has been done to fault a drive
with an over long service time?

One of my systems pretty much ground to a halt with next to no write
active today, iostat showed one drive (the pool is a stripe of mirrors)
having a service time close to half a second... This wasn't good news
for the VMs using the storage.

I was eventually able to detach the drive and attach a spare, but it
would have been nice for this to have been recognised as an error
condition. It's probably the worst kind of error in a pool!

The good thing is the system eventually recovered without a reboot.

--
Ian.

Berend de Boer

2013-10-07 07:15:35 UTC

Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!

Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.

I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

--
All the best,

Berend de Boer

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-10-07 17:44:32 UTC

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries. There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.

There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-10-07 18:04:52 UTC

Agreed. This needs holistic treatment and FMA is the place to do this for illumos.

Responding to the diagnosis can be done by an FMA module that is aware of ZFS and can take the appropriate action.

Identifying the failure might be done by way of some statistical analysis of io patterns combined with drive diagnostic data (SMART).

Sent from my iPhone

Post by Richard Elling

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries. There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.
There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.
-- richard
--
+1-760-896-4422
illumos-zfs | Archives | Modify Your Subscription

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-10-07 18:15:25 UTC

Post by Garrett D'Amore
Agreed. This needs holistic treatment and FMA is the place to do this for illumos.
Responding to the diagnosis can be done by an FMA module that is aware of ZFS and can take the appropriate action.

yes, a new diagnostics engine that feeds to zfs-retire.
Unfortunately, I think the right place to put some of this is in sd.c (illumos) and we all
know that sd.c is a difficult place to work.

I'm very interested in exploring how we can provide a more portable implementation of
fault diagnosis that integrates tightly with ZFS. Thoughts?

Post by Garrett D'Amore
Identifying the failure might be done by way of some statistical analysis of io patterns combined with drive diagnostic data (SMART).

If by SMART you mean looking at the disk logs, then yes. If by SMART you think it
actually does something useful for these cases, you'll be disappointed :-(
-- richard

Post by Garrett D'Amore
Sent from my iPhone

Post by Richard Elling

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries. There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.
There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.
-- richard
--
+1-760-896-4422
illumos-zfs | Archives | Modify Your Subscription

illumos-zfs | Archives | Modify Your Subscription

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Ian Collins

2013-10-07 20:11:04 UTC

Post by Richard Elling

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries.

I believe that was the case here, the system fell over the cliff when I
copied a 13G file into pool.

Post by Richard Elling
There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.

In this case, with the write IOPs reduced to single digits and busy at
85%, the data was reasonably unambiguous!

Post by Richard Elling
There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.

I agree, that was the first place I looked when the system started
playing up.

--
Ian.

Richard Elling

2013-10-08 00:19:00 UTC

Post by Richard Elling

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries.

I believe that was the case here, the system fell over the cliff when I copied a 13G file into pool.

Post by Richard Elling
There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.

In this case, with the write IOPs reduced to single digits and busy at 85%, the data was reasonably unambiguous!

What is your sample interval? The reason I ask is because the % busy is simply a
measure of the amount of total time where at least one I/O was issued to the device.
For slow devices and small sample intervals, we see 100% busy. Indeed, if the I/O
does not complete, then it could be 100% busy with zero I/O operations during the
interval.

Post by Richard Elling
There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.

I agree, that was the first place I looked when the system started playing up.

Depending on the hardware, there are better ways to get health information. One
problem with relying on latency measurements is that the easy way to do those
measurements is to take the difference of the start and end times. What happens
when the end time never comes? I'm not convinced we can use latency alone to
achieve a diagnosis, we'll need some other measurement or device log data to
make a concrete diagnosis.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Keith Wesolowski

2013-10-08 00:34:11 UTC

Post by Richard Elling
Depending on the hardware, there are better ways to get health information. One
problem with relying on latency measurements is that the easy way to do those
measurements is to take the difference of the start and end times. What happens
when the end time never comes? I'm not convinced we can use latency alone to
achieve a diagnosis, we'll need some other measurement or device log data to
make a concrete diagnosis.

If a device's controller is busted or there is a fabric issue between
the initiator and target, issuing another I/O to get log data is
unlikely to result in usable data. Improved timeout handling and much
shorter timeouts have to play a role here, too. I agree that we can and
should use other sources of information where available but I also think
a fairly simple DE would help immensely.

The case in which the device simply never responds at all does already
generate errors that will lead to a correct diagnosis in reasonable time
if the timeout is sufficiently small. There's certainly room for
improvement there, but the topic of interest here is the case where the
device successfully completes (some or all) requests only after hundreds
or thousands of milliseconds. I've seen this happen with only a very
small number of accompanying rqs.derr, rqs.merr, and/or disk.tran
ereports -- few enough that no reasonable diagnosis could be made on
their basis alone. Asking the disk for diagnostic data is fine, but we
need to move on to a diagnosis fairly quickly regardless of whether we
get that data or what it tells us. It's really just a shortcut.

Ian Collins

2013-10-08 01:07:33 UTC

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries.

I believe that was the case here, the system fell over the cliff when
I copied a 13G file into pool.

Post by Richard Elling
There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.

In this case, with the write IOPs reduced to single digits and busy
at 85%, the data was reasonably unambiguous!

What is your sample interval? The reason I ask is because the % busy is simply a
measure of the amount of total time where at least one I/O was issued to the device.
For slow devices and small sample intervals, we see 100% busy. Indeed, if the I/O
does not complete, then it could be 100% busy with zero I/O operations during the
interval.

I was using a 10 second interval.

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling
There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.

I agree, that was the first place I looked when the system started playing up.

Depending on the hardware, there are better ways to get health
information. One
problem with relying on latency measurements is that the easy way to do those
measurements is to take the difference of the start and end times. What happens
when the end time never comes? I'm not convinced we can use latency alone to
achieve a diagnosis, we'll need some other measurement or device log data to
make a concrete diagnosis.

Although there are obvious cases, such as an order of magnitude greater
than the expected worst case (for a SATA drive that is).

This kind of extreme sample (especially when there are other devices in
a pool, which should exhibit similar behaviour) should be easy to pick
up. The device would have rapidly stepped from a normal (>50ms) to an
extreme (>500ms) service time.

--
Ian.

Richard Elling

2013-10-08 01:32:55 UTC

more far below...

Post by Ian Collins

Post by Richard Elling

Post by Richard Elling

Post by Berend de Boer
Ian> I was eventually able to detach the drive and attach a spare,
Ian> but it would have been nice for this to have been recognised
Ian> as an error condition. It's probably the worst kind of error
Ian> in a pool!
Slow writes are not an error. For virtual disks which are out for a
few minutes and then recover it is the best option to simply wait.
I suggest you simply monitor service time, and when it gets beyond
your threshold, set an alarm.

The reason this needs FMA for correlation, as Keith suggests, is because
there is a (common?) failure mode in disks where some, but not all, LBAs
need additional retries.

I believe that was the case here, the system fell over the cliff when I copied a 13G file into pool.

Post by Richard Elling
There are usually counters, like the SCSI read-write
error recovery mode page, that can be queried to see if the disk is having
difficulty. But do not be surprised if the disk behaves nicely for a lot of I/Os
and poorly for a few.

In this case, with the write IOPs reduced to single digits and busy at 85%, the data was reasonably unambiguous!

What is your sample interval? The reason I ask is because the % busy is simply a
measure of the amount of total time where at least one I/O was issued to the device.
For slow devices and small sample intervals, we see 100% busy. Indeed, if the I/O
does not complete, then it could be 100% busy with zero I/O operations during the
interval.

I was using a 10 second interval.

yep, with that long interval it is impossible to do much latency work :-)
Your sample interval would catch two txg commits and a bunch of idle time, for
the common write workloads we tend to see.

Post by Ian Collins

Post by Richard Elling

Post by Richard Elling
There is a lot more that can be done here, too, but it really belongs in FMA
for illumos.

I agree, that was the first place I looked when the system started playing up.

Depending on the hardware, there are better ways to get health information. One
problem with relying on latency measurements is that the easy way to do those
measurements is to take the difference of the start and end times. What happens
when the end time never comes? I'm not convinced we can use latency alone to
achieve a diagnosis, we'll need some other measurement or device log data to
make a concrete diagnosis.

Although there are obvious cases, such as an order of magnitude greater than the expected worst case (for a SATA drive that is).
This kind of extreme sample (especially when there are other devices in a pool, which should exhibit similar behaviour) should be easy to pick up. The device would have rapidly stepped from a normal (>50ms) to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

The good news is that some disks log I/O operations that result in error correction. Some
also show the number of I/O operations where error correction resulted in additional
latency. For SCSI disks, these are usually reported in the error code write (ECW) or
error code read (ECR) log pages as "errors corrected with possible delays"

For example, if a disk is having difficulty with tracking, then you might see extra rotations
for reads as the head is adjusted and media rotates back around. This should be logged
and you'll notice the extra rotation delay, or multiples of the rotation delay in the latency.

Also, it is possible for flaky wiring to be experienced as latency to one or more disks.
Replacing the disk might not fix this failure mode, so you need to differentiate those
latency problems affecting the internal disk from those in the interconnect fabric. This
becomes more difficult for FC or SAS fabrics that can have many parts in the data path.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-10-08 02:11:37 UTC

Post by Richard Elling

Post by Ian Collins
This kind of extreme sample (especially when there are other devices in a pool, which should exhibit similar behaviour) should be easy to pick up. The device would have rapidly stepped from a normal (>50ms) to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

500 msec is *huge* for I/O. Intolerable for some classes of application, really. However, if you're seeing errors, it might be nominal. The zfs_vdev_max_pending=10 is obviously too large for typical spinning rust, and the sort algorithms used in drive firmwares could theoretically "starve" an I/O to an outlier track if other I/Os are continuously occurring elsewhere in the drive. Modern drives typically service I/Os in less than 10msec on average. But if you have several retries, and deep queues, it can add up quickly.

The optimization of vdev_max_pending was an area I had done some work -- for SSDs (or other truly random access with deep I/O queues and parallelization capabilities) you typically want to set this value pretty high - high enough that all parallel pipelines are always full so that no pipeline stage ever stalls due to lack of I/O requests ready to be serviced. This is also true for vdevs where the vdev is an array. (In such cases the arrays usually have parallelization in them.)

Conversely, individual spindles usually can only schedule a few requests, and usually have short pipelines. While some depth may help firmware optimize its sort algorithms, this work can penalize other I/Os pretty badly. I encourage a small vdev_max_pending for such configurations -- 2 is usually large enough, and 4 is always enough (from observation.)

There are probably more formal methods that can be applied… consideration of changes in latency, dividing by actual queue depth, and noticing when queue depth does not reduce overall average latency. Automating this would be a nice feature.

In the meantime, per-vdev settings, and and automatically determination based on reported (via SCSI inquiry) rotational delay (which is a good indicator of SSD or RAID vs. single spindle) is a good first order estimate. Anything with a non-zero rotation speed can be assumed to be a single spindle and should get vdev max pending of between 2 and 4. (I would choose 2, personally.)

Everything else can take a larger value -- 10 isn't a bad first guess, although it is just a guess. (If you know the number of spindles in an array, I'd use 2 x spindle count, limited to 12 or thereabouts. Its not ideal, and writes to an array can have different properties than reads, but splitting that measurement up might be tricky. There's probably at least a master's thesis in there for someone to work on a project to autotune queues and *split* read and write queue depths.)

The other thing is that there may be substantial differences in enterprise class drives from consumer grade drives. Consumer grade drives will keep trying a bad sector forever. Enterprise drives are more likely to give up and report the error. Which is a good thing for ZFS. Also, I wonder if there are starvation/QoS components to the scheduling algorithms (disksort) that is applied to Enterprise drive firmware vs. consumer grade firmware. (Would welcome input from someone with direct knowledge here.)

- Garrett

Richard Elling

2013-10-08 02:54:39 UTC

Post by Garrett D'Amore

Post by Richard Elling

Post by Ian Collins
This kind of extreme sample (especially when there are other devices in a pool, which should exhibit similar behaviour) should be easy to pick up. The device would have rapidly stepped from a normal (>50ms) to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

500 msec is *huge* for I/O. Intolerable for some classes of application, really. However, if you're seeing errors, it might be nominal.

Yep. Check the disk specifications for details. For example, the Seagate Constellation SAS
drives have a default max read recovery time of 2311.47 milliseconds and write recovery time
of 147.72 milliseconds. If your application can't handle that, then you'll need to change the default.
In many cases, SAS drives also allow you to change the default recovery time limit. For SATA...
the industry is less interested in enterprise-class features.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Boris Protopopov

2013-10-08 12:30:34 UTC

I assume this was a zio latency, not application io latency ? The former might be high simply due to vdev queuing.

Typos courtesy of my iPhone

Post by Garrett D'Amore

Post by Richard Elling

Post by Ian Collins
This kind of extreme sample (especially when there are other devices in a pool, which should exhibit similar behaviour) should be easy to pick up. The device would have rapidly stepped from a normal (>50ms) to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

500 msec is *huge* for I/O. Intolerable for some classes of application, really. However, if you're seeing errors, it might be nominal. The zfs_vdev_max_pending=10 is obviously too large for typical spinning rust, and the sort algorithms used in drive firmwares could theoretically "starve" an I/O to an outlier track if other I/Os are continuously occurring elsewhere in the drive. Modern drives typically service I/Os in less than 10msec on average. But if you have several retries, and deep queues, it can add up quickly.
The optimization of vdev_max_pending was an area I had done some work -- for SSDs (or other truly random access with deep I/O queues and parallelization capabilities) you typically want to set this value pretty high - high enough that all parallel pipelines are always full so that no pipeline stage ever stalls due to lack of I/O requests ready to be serviced. This is also true for vdevs where the vdev is an array. (In such cases the arrays usually have parallelization in them.)
Conversely, individual spindles usually can only schedule a few requests, and usually have short pipelines. While some depth may help firmware optimize its sort algorithms, this work can penalize other I/Os pretty badly. I encourage a small vdev_max_pending for such configurations -- 2 is usually large enough, and 4 is always enough (from observation.)
There are probably more formal methods that can be applied… consideration of changes in latency, dividing by actual queue depth, and noticing when queue depth does not reduce overall average latency. Automating this would be a nice feature.
In the meantime, per-vdev settings, and and automatically determination based on reported (via SCSI inquiry) rotational delay (which is a good indicator of SSD or RAID vs. single spindle) is a good first order estimate. Anything with a non-zero rotation speed can be assumed to be a single spindle and should get vdev max pending of between 2 and 4. (I would choose 2, personally.)
Everything else can take a larger value -- 10 isn't a bad first guess, although it is just a guess. (If you know the number of spindles in an array, I'd use 2 x spindle count, limited to 12 or thereabouts. Its not ideal, and writes to an array can have different properties than reads, but splitting that measurement up might be tricky. There's probably at least a master's thesis in there for someone to work on a project to autotune queues and *split* read and write queue depths.)
The other thing is that there may be substantial differences in enterprise class drives from consumer grade drives. Consumer grade drives will keep trying a bad sector forever. Enterprise drives are more likely to give up and report the error. Which is a good thing for ZFS. Also, I wonder if there are starvation/QoS components to the scheduling algorithms (disksort) that is applied to Enterprise drive firmware vs. consumer grade firmware. (Would welcome input from someone with direct knowledge here.)
- Garrett
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Keith Wesolowski

2013-10-08 16:15:55 UTC

Post by Boris Protopopov
I assume this was a zio latency, not application io latency ? The former might be high simply due to vdev queuing.

This would be latency from sd's perspective, although zio latency should
be similar.

Don't forget that the latency of an I/O visible to a driver, sd, or ZFS
is the sum of the latency of all I/Os that the disk executed prior to
the I/O in question plus its own latency. For random reads, it's not
difficult to end up with occasional latency of 100+ ms even when the
device is functioning properly. ZFS's relatively low queue depth does
help to limit this -- if you used the full 255 queue depth it would be
very possible to see latency over 1s even when nothing is wrong. This
effect must be accounted for in any latency-sensitive DE.

Anyway, this seems off topic for ZFS and we should probably take it to
developer if there's anything useful left to say.

Jim Klimov

2013-10-09 17:38:32 UTC

Post by Garrett D'Amore
The optimization of vdev_max_pending was an area I had done some work -- for SSDs (or other truly random access with deep I/O queues and parallelization capabilities) you typically want to set this value pretty high - high enough that all parallel pipelines are always full so that no pipeline stage ever stalls due to lack of I/O requests ready to be serviced. This is also true for vdevs where the vdev is an array. (In such cases the arrays usually have parallelization in them.)

Before this thread is closed into oblivion, there's one more question:
are "vdev_max_pending" settings currently set only system-wide, or
can be assigned to individual VDEVs indeed? Is this RTI'd into the
illumos-gate?

For example, can I use a short queue for spinning rust with a long
queue for L2ARC and ZIL SSDs grouped in the same pool, or use short
queues for HDD-based data pool and a long queue for SSD-based rpool?

If yes, by what mechanism can I set this up? :)
So far I've seen the all-or-nothing ways:

# grep vdev /etc/system
set zfs:zfs_vdev_max_pending = 3

or similar with "mdb -kw"...

Thanks,
//Jim

Boris Protopopov

2013-10-09 18:42:58 UTC

I thought vdev_max_pending want away recently in the "write throttling
rework commit".

Post by Jim Klimov

Post by Garrett D'Amore
The optimization of vdev_max_pending was an area I had done some work --
for SSDs (or other truly random access with deep I/O queues and
parallelization capabilities) you typically want to set this value pretty
high - high enough that all parallel pipelines are always full so that no
pipeline stage ever stalls due to lack of I/O requests ready to be
serviced. This is also true for vdevs where the vdev is an array. (In
such cases the arrays usually have parallelization in them.)

are "vdev_max_pending" settings currently set only system-wide, or
can be assigned to individual VDEVs indeed? Is this RTI'd into the
illumos-gate?
For example, can I use a short queue for spinning rust with a long
queue for L2ARC and ZIL SSDs grouped in the same pool, or use short
queues for HDD-based data pool and a long queue for SSD-based rpool?
If yes, by what mechanism can I set this up? :)
# grep vdev /etc/system
set zfs:zfs_vdev_max_pending = 3
or similar with "mdb -kw"...
Thanks,
//Jim
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
23052084-8e2408bc<https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=23052084-f99c1c78<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

--
Best regards,

Boris Protopopov
Nexenta Systems

455 El Camino Real, Santa Clara, CA 95050

[d] 408.791.3366 | [c] 978.621.6901
Skype: bprotopopov

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-09 23:35:29 UTC

That is correct. It is gone now.

I thought vdev_max_pending want away recently in the "write throttling rework commit".

Post by Jim Klimov

Post by Garrett D'Amore
The optimization of vdev_max_pending was an area I had done some work -- for SSDs (or other truly random access with deep I/O queues and parallelization capabilities) you typically want to set this value pretty high - high enough that all parallel pipelines are always full so that no pipeline stage ever stalls due to lack of I/O requests ready to be serviced. This is also true for vdevs where the vdev is an array. (In such cases the arrays usually have parallelization in them.)

are "vdev_max_pending" settings currently set only system-wide, or
can be assigned to individual VDEVs indeed? Is this RTI'd into the
illumos-gate?
For example, can I use a short queue for spinning rust with a long
queue for L2ARC and ZIL SSDs grouped in the same pool, or use short
queues for HDD-based data pool and a long queue for SSD-based rpool?
If yes, by what mechanism can I set this up? :)
# grep vdev /etc/system
set zfs:zfs_vdev_max_pending = 3
or similar with "mdb -kw"...
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Best regards,
Boris Protopopov
Nexenta Systems
455 El Camino Real, Santa Clara, CA 95050
[d] 408.791.3366 | [c] 978.621.6901 Skype: bprotopopov
illumos-zfs | Archives | Modify Your Subscription

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-10-10 03:32:12 UTC

I guess this is a good thing. I need to read the new code, clearly.

Sent from my iPhone

Post by Richard Yao
That is correct. It is gone now.

I thought vdev_max_pending want away recently in the "write throttling rework commit".

Post by Jim Klimov

Post by Garrett D'Amore
The optimization of vdev_max_pending was an area I had done some work -- for SSDs (or other truly random access with deep I/O queues and parallelization capabilities) you typically want to set this value pretty high - high enough that all parallel pipelines are always full so that no pipeline stage ever stalls due to lack of I/O requests ready to be serviced. This is also true for vdevs where the vdev is an array. (In such cases the arrays usually have parallelization in them.)

are "vdev_max_pending" settings currently set only system-wide, or
can be assigned to individual VDEVs indeed? Is this RTI'd into the
illumos-gate?
For example, can I use a short queue for spinning rust with a long
queue for L2ARC and ZIL SSDs grouped in the same pool, or use short
queues for HDD-based data pool and a long queue for SSD-based rpool?
If yes, by what mechanism can I set this up? :)
# grep vdev /etc/system
set zfs:zfs_vdev_max_pending = 3
or similar with "mdb -kw"...
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Best regards,
Boris Protopopov
Nexenta Systems
455 El Camino Real, Santa Clara, CA 95050
[d] 408.791.3366 | [c] 978.621.6901 Skype: bprotopopov
illumos-zfs | Archives | Modify Your Subscription

illumos-zfs | Archives | Modify Your Subscription

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-10-10 10:41:29 UTC

Post by Garrett D'Amore
I guess this is a good thing. I need to read the new code, clearly.

Post by Richard Yao
That is correct. It is gone now.

Post by Boris Protopopov
I thought vdev_max_pending want away recently in the "write
throttling rework commit".

Interesting indeed... did this hit the illumos-gate?

And are there any usage/setup docs beside the discussion of
the change a couple of months back on this list? :)
//Jim

Boris Protopopov

2013-10-10 13:03:44 UTC

There are very informative "big theory" comments, as I recall.

Typos courtesy of my iPhone

Post by Jim Klimov

Post by Garrett D'Amore
I guess this is a good thing. I need to read the new code, clearly.

Post by Richard Yao
That is correct. It is gone now.

Post by Boris Protopopov
I thought vdev_max_pending want away recently in the "write
throttling rework commit".

Interesting indeed... did this hit the illumos-gate?
And are there any usage/setup docs beside the discussion of
the change a couple of months back on this list? :)
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-11 22:49:14 UTC

Here is a link:

https://github.com/illumos/illumos-gate/commit/69962b5647e4a8b9b14998733b765925381b727e

Post by Jim Klimov

Post by Garrett D'Amore
I guess this is a good thing. I need to read the new code, clearly.

Post by Richard Yao
That is correct. It is gone now.

Post by Boris Protopopov
I thought vdev_max_pending want away recently in the "write
throttling rework commit".

Interesting indeed... did this hit the illumos-gate?
And are there any usage/setup docs beside the discussion of
the change a couple of months back on this list? :)
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-10-09 20:02:47 UTC

It's all or nothing. I have changes I submitted for perusal months ago sitting around somewhere. I can track them down.

(We got hung up on my changes to the default values for spinning vs non spinning media.)

Sent from my iPhone

Post by Jim Klimov

Post by Garrett D'Amore
The optimization of vdev_max_pending was an area I had done some work -- for SSDs (or other truly random access with deep I/O queues and parallelization capabilities) you typically want to set this value pretty high - high enough that all parallel pipelines are always full so that no pipeline stage ever stalls due to lack of I/O requests ready to be serviced. This is also true for vdevs where the vdev is an array. (In such cases the arrays usually have parallelization in them.)

are "vdev_max_pending" settings currently set only system-wide, or
can be assigned to individual VDEVs indeed? Is this RTI'd into the
illumos-gate?
For example, can I use a short queue for spinning rust with a long
queue for L2ARC and ZIL SSDs grouped in the same pool, or use short
queues for HDD-based data pool and a long queue for SSD-based rpool?
If yes, by what mechanism can I set this up? :)
# grep vdev /etc/system
set zfs:zfs_vdev_max_pending = 3
or similar with "mdb -kw"...
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Ian Collins

2013-10-08 03:28:29 UTC

Post by Richard Elling
more far below...

Post by Ian Collins
Although there are obvious cases, such as an order of magnitude
greater than the expected worst case (for a SATA drive that is).
This kind of extreme sample (especially when there are other devices
in a pool, which should exhibit similar behaviour) should be easy to
pick up. The device would have rapidly stepped from a normal (>50ms)
to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

Are we talking the same thing here? I don't think I've ever seen more
than double digit asvc_t for healthy (7200 PRM) SATA drives. Certainly
not consistent triple digit numbers. The real giveaway would have been
6 months in the 20-30ms range then a steady value in the hundreds, while
all the other drives were still in the 20-30ms range.

Post by Richard Elling
The good news is that some disks log I/O operations that result in error correction. Some
also show the number of I/O operations where error correction resulted in additional
latency. For SCSI disks, these are usually reported in the error code write (ECW) or
error code read (ECR) log pages as "errors corrected with possible delays"
For example, if a disk is having difficulty with tracking, then you
might see extra rotations
for reads as the head is adjusted and media rotates back around. This should be logged
and you'll notice the extra rotation delay, or multiples of the
rotation delay in the latency.
Also, it is possible for flaky wiring to be experienced as latency to one or more disks.
Replacing the disk might not fix this failure mode, so you need to differentiate those
latency problems affecting the internal disk from those in the
interconnect fabric. This
becomes more difficult for FC or SAS fabrics that can have many parts in the data path.

Yes, I can see that being an issue. Any algorithm used would have to be
aware of the system's configuration.

--
Ian.

Richard Elling

2013-10-08 04:05:49 UTC

Post by Richard Elling
more far below...

Post by Ian Collins
Although there are obvious cases, such as an order of magnitude greater than the expected worst case (for a SATA drive that is).
This kind of extreme sample (especially when there are other devices in a pool, which should exhibit similar behaviour) should be easy to pick up. The device would have rapidly stepped from a normal (>50ms) to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

Are we talking the same thing here? I don't think I've ever seen more than double digit asvc_t for healthy (7200 PRM) SATA drives. Certainly not consistent triple digit numbers. The real giveaway would have been 6 months in the 20-30ms range then a steady value in the hundreds, while all the other drives were still in the 20-30ms range.

Yes, but iostat reports an average, not the distribution. The outliers are easily lost in the average.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Ian Collins

2013-10-08 04:11:51 UTC

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling
There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the
theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to
see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache
flushes at the txg commit
and you can easily see large latency.

Are we talking the same thing here? I don't think I've ever seen
more than double digit asvc_t for healthy (7200 PRM) SATA drives.
Certainly not consistent triple digit numbers. The real giveaway
would have been 6 months in the 20-30ms range then a steady value in
the hundreds, while all the other drives were still in the 20-30ms range.

Yes, but iostat reports an average, not the distribution. The outliers
are easily lost in the average.

Yes, but when the outliers become the norm, the average goes with
them... In other words in my case, the average for the dud drive was
the outlier in the group of drive averages, or in its own pool of
historical samples.

--
Ian.

Richard Elling

2013-10-08 05:03:07 UTC

Post by Richard Elling

Post by Richard Elling
There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

Are we talking the same thing here? I don't think I've ever seen more than double digit asvc_t for healthy (7200 PRM) SATA drives. Certainly not consistent triple digit numbers. The real giveaway would have been 6 months in the 20-30ms range then a steady value in the hundreds, while all the other drives were still in the 20-30ms range.

Yes, but iostat reports an average, not the distribution. The outliers are easily lost in the average.

Yes, but when the outliers become the norm, the average goes with them... In other words in my case, the average for the dud drive was the outlier in the group of drive averages, or in its own pool of historical samples.

Right, but the case that seems worse is when some are fast, some are slow.
You can run a test, and it passes with flying colors. But try to get one specific
LBA and latency is horrible :-(
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Schlacta, Christ

2013-10-08 19:35:43 UTC

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling
There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to

see the affects of the

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling
elevator algorithms in the disk. Couple that with a pair of cache

flushes at the txg commit

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling
and you can easily see large latency.

Are we talking the same thing here? I don't think I've ever seen more

than double digit asvc_t for healthy (7200 PRM) SATA drives. Certainly not
consistent triple digit numbers. The real giveaway would have been 6
months in the 20-30ms range then a steady value in the hundreds, while all
the other drives were still in the 20-30ms range.

Post by Richard Elling
Yes, but iostat reports an average, not the distribution. The outliers

are easily lost in the average.

Yes, but when the outliers become the norm, the average goes with them...

In other words in my case, the average for the dud drive was the outlier
in the group of drive averages, or in its own pool of historical samples.
Wrong. When the outliers become the norm, the average over time increases.
If the drive failure occurs at t1, and results in the behavior you
describe, the average from t0 until t1 will be some value x0. However, from
t1 to some time t2 the average will be higher. Further more, as time n
increases, the average of t0 to tn will be asymptotically increasing toward
the average of t1 to tn. Therefore, for each device we can detect the
failure as deviation from single device amortized response time.
I would propose that a variation of more than n standard deviations over
some low m as determined by the administrator will trigger a fault
condition.
This will allow administrators to designate environment specific acceptable
response curves and failure conditions as simple polynomials.

--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now

https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a
https://www.listbox.com/member/?&

Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Ian Collins

2013-10-08 20:01:28 UTC

Post by Richard Elling

Post by Ian Collins

Post by Richard Elling
Yes, but iostat reports an average, not the distribution. The

outliers are easily lost in the average.

Post by Ian Collins
Yes, but when the outliers become the norm, the average goes with

them... In other words in my case, the average for the dud drive was
the outlier in the group of drive averages, or in its own pool of
historical samples.
Wrong. When the outliers become the norm, the average over time increases.

No, you misunderstood what I wrote.

As Richard said, "iostat reports an average, not the distribution". So
if the sampled times used to calculate the average jump up, the average
(reported value) also jumps up.

Post by Richard Elling
I would propose that a variation of more than n standard deviations
over some low m as determined by the administrator will trigger a
fault condition.
This will allow administrators to designate environment specific
acceptable response curves and failure conditions as simple polynomials.

I agree with that part!

--
Ian.

Paul Kraus

2013-10-09 13:01:23 UTC

On Oct 8, 2013, at 3:35 PM, "Schlacta, Christ" <***@aarcane.org> wrote:

Therefore, for each device we can detect the failure as deviation from single device amortized response time.

I would propose that a variation of more than n standard deviations over some low m as determined by the administrator will trigger a fault condition.

I have seen drive populations where there are two (or more) sets of "normal" responses. The best example I can recall (I have been away from managing large numbers of spindles for over a year now) was a server with 120 750GB Sun badged SATA drives. The Seagate drives had a very noticeably different service time than the Hitachi drives. Within each population it was easy to spot a drive starting to go bad from the iostat service time values over time (60 second sample over the work day). Since the drives were all in one zpool (or a hot spare), and had been since the inception of the zpool, the workload on each was very similar.

Sorting iostat output by service time led to three groupings, the Seagate, the Hitachi, and the hot spares (I forget whether the Seagate or the Hitachi were faster). You could not spot a hot spare going bad (not enough activity), but spotting a drive failing in either of the others was easy.

This will allow administrators to designate environment specific acceptable response curves and failure conditions as simple polynomials.

Agreed.

--
Paul Kraus
Deputy Technical Director, LoneStarCon 3
Sound Coordinator, Schenectady Light Opera Company

Henk Langeveld

2013-10-09 17:03:52 UTC

On 9 Oct 15:01 , Paul Kraus wrote in reply to: "Schlacta, Christ"

Post by Paul Kraus
I have seen drive populations where there are two (or more) sets of
"normal" responses. The best example I can recall (I have been away
from managing large numbers of spindles for over a year now) was a
server with 120 750GB Sun badged SATA drives. The Seagate drives had
a very noticeably different service time than the Hitachi drives.
Within each population it was easy to spot a drive starting to go bad
from the iostat service time values over time (60 second sample over
the work day). Since the drives were all in one zpool (or a hot
spare), and had been since the inception of the zpool, the workload
on each was very similar.
Sorting iostat output by service time led to three groupings, the
Seagate, the Hitachi, and the hot spares (I forget whether the
Seagate or the Hitachi were faster). You could not spot a hot spare
going bad (not enough activity), but spotting a drive failing in
either of the others was easy.

Incidentally, I was testing two methods to collect stats on drive delay:

1. scanning kstat quite aggressively (1s. intervals) for sd:::[wr]cnt

It's quite amusing to see all disks in the pool hit
sd:*:*:rcnt values of 10, while sd:*:*:wcnt counts stay quite low.
This is what I would expect for a well behaving set of disks.

Annoyingly, kstat only knows about 'sd' targets, no c*t*d* here.

2. Scanning the wait/actv/wsvc_t/ascv_t and %w and %b columns of iostat.

An interesting observation here is that you can include the controller
(-C) and see combinations of high %b for the controller(s) with high %w
for the pool. Again, this appears logical; the disks get fed the
traffic they're expected to handle, and the pool sits there waiting.

Again, individual drives mostly show extreme low values for wait and
wsvc_t. A simple sort on the wsvc_t and asvc_t columns will show the
disks with the higher delay at the bottom.

HOWEVER,

Some samples show *huge* numbers for these columns.

***@omnios-vagrant:/export/home/vagrant# LANG=C TZ=UTC iostat -Td
-Cxznd 10 2
Wed Oct 9 16:56:49 UTC 2013
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
2.1 56.6 137.0 6465.3 198669.0 198666.8 3384810.6 3384774.1 5
3 c2
2.1 56.6 137.0 6465.3 198669.0 198666.8 3384810.6 3384774.1 5
3 c2t0d0
Wed Oct 9 16:56:59 UTC 2013
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
3.6 401.9 284.8 46681.8 0.2 1858604528.4 0.5 4583041961.9 7
18 c2
3.6 401.9 284.8 46681.8 0.2 1858604528.4 0.5 4583041961.9 7
18 c2t0d0

You might think this would be a VM artifact, and so did I at first.
I found that our production ZFS pool shows the same weird stats,
although they are rare.

Is this a case of a sampling error, and the counters/timers don't get
reset properly?

Followup to ***@...

Regards,

Henk Langeveld

Boris Protopopov

2013-10-08 12:39:54 UTC

I think the key here is comparison across the drives, e.g. across child vdevs. Othe factors equal, this allows one to detect vdev specific anomalies.

Typos courtesy of my iPhone

Post by Richard Elling
more far below...

Post by Ian Collins
Although there are obvious cases, such as an order of magnitude greater than the expected worst case (for a SATA drive that is).
This kind of extreme sample (especially when there are other devices in a pool, which should exhibit similar behaviour) should be easy to pick up. The device would have rapidly stepped from a normal (>50ms) to an extreme (>500ms) service time.

There are some easy dtrace scripts that can be written to alert when latency becomes
large. But 500ms is not large for HDDs. Under heavy load and with the default
zfs_vdev_max_pending=10, a 5400 rpm SATA drive with NCQ can easily see 1 second
latencies when there are no problems. To get closer to the theoretical average response
time you have to reduce max pending to 1 or 2. At 4, you'll start to see the affects of the
elevator algorithms in the disk. Couple that with a pair of cache flushes at the txg commit
and you can easily see large latency.

Are we talking the same thing here? I don't think I've ever seen more than double digit asvc_t for healthy (7200 PRM) SATA drives. Certainly not consistent triple digit numbers. The real giveaway would have been 6 months in the 20-30ms range then a steady value in the hundreds, while all the other drives were still in the 20-30ms range.

Post by Richard Elling
The good news is that some disks log I/O operations that result in error correction. Some
also show the number of I/O operations where error correction resulted in additional
latency. For SCSI disks, these are usually reported in the error code write (ECW) or
error code read (ECR) log pages as "errors corrected with possible delays"
For example, if a disk is having difficulty with tracking, then you might see extra rotations
for reads as the head is adjusted and media rotates back around. This should be logged
and you'll notice the extra rotation delay, or multiples of the rotation delay in the latency.
Also, it is possible for flaky wiring to be experienced as latency to one or more disks.
Replacing the disk might not fix this failure mode, so you need to differentiate those
latency problems affecting the internal disk from those in the interconnect fabric. This
becomes more difficult for FC or SAS fabrics that can have many parts in the data path.

Yes, I can see that being an issue. Any algorithm used would have to be aware of the system's configuration.
--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Keith Wesolowski

2013-10-07 16:16:20 UTC

Post by Ian Collins
I can't remember if the topic has come up before (I'm sure it probably
has) but I'd like to know if any work has been done to fault a drive
with an over long service time?

We've discussed it here, and I plan to implement it in SmartOS at some
point in the next year. But it will be implemented in FMA, not ZFS,
because this is a much lower-level failure that is properly handled at
that layer of the system, using telemetry from layers below ZFS. Put
another way, a disk being used by UFS or as a raw block device should be
diagnosed and retired in the same way.

FWIW, this is our (sadly, still secret) bug ID OS-2519. Once we're
happy with whatever we implement, I'll open an illumos bug and it'll
make its way there.

Post by Ian Collins
One of my systems pretty much ground to a halt with next to no write
active today, iostat showed one drive (the pool is a stripe of mirrors)
having a service time close to half a second... This wasn't good news
for the VMs using the storage.

Yep. That happens sometimes. Disks suck.

Henk Langeveld

2013-10-08 16:39:49 UTC

Post by Keith Wesolowski

Post by Ian Collins
I can't remember if the topic has come up before (I'm sure it probably
has) but I'd like to know if any work has been done to fault a drive
with an over long service time?

We've discussed it here, and I plan to implement it in SmartOS at some
point in the next year. But it will be implemented in FMA, not ZFS,
because this is a much lower-level failure that is properly handled at
that layer of the system, using telemetry from layers below ZFS. Put
another way, a disk being used by UFS or as a raw block device should be
diagnosed and retired in the same way.
FWIW, this is our (sadly, still secret) bug ID OS-2519. Once we're
happy with whatever we implement, I'll open an illumos bug and it'll
make its way there.

This looks similar to illumos issue 1553:
https://www.illumos.org/issues/1553 Feature #1553

| ZFS should not trust the layers underneath regarding drive
timeouts/failure
| Added by Alasdair Lumsden about 2 years ago. Updated 12 months ago.
|
| Status: New Start date: 2011-09-22 Priority: High

I quickly recognise a couple of the same themes as this thread:

* This belongs in FMA
* It belongs in the sd driver
* sd is a pig to maintain.

Regards,

Henk Langeveld

Keith Wesolowski

2013-10-08 16:53:35 UTC

Post by Henk Langeveld
* This belongs in FMA

True.

Post by Henk Langeveld
* It belongs in the sd driver

Sort of. The telemetry generation does; the diagnosis emphatically does
not. And some of the telemetry is already there and we aren't even
making use of it.

Post by Henk Langeveld
* sd is a pig to maintain.

It sucks, but I don't believe there's any reason adding this needs to be
hard because of sd. The difficult part is in figuring out how to
represent and consume the telemetry, not in where to generate it. Most
of what sucks about sd is unrelated to this. At least half of everyone
who ever works on SunOS seriously reaches a point at which they decide
they should rewrite sd.c. Those who become veterans eventually
understand why they should not.

Garrett D'Amore

2013-10-08 17:03:32 UTC

Post by Keith Wesolowski

Post by Ian Collins
I can't remember if the topic has come up before (I'm sure it probably
has) but I'd like to know if any work has been done to fault a drive
with an over long service time?

We've discussed it here, and I plan to implement it in SmartOS at some
point in the next year. But it will be implemented in FMA, not ZFS,
because this is a much lower-level failure that is properly handled at
that layer of the system, using telemetry from layers below ZFS. Put
another way, a disk being used by UFS or as a raw block device should be
diagnosed and retired in the same way.
FWIW, this is our (sadly, still secret) bug ID OS-2519. Once we're
happy with whatever we implement, I'll open an illumos bug and it'll
make its way there.

This looks similar to illumos issue 1553: https://www.illumos.org/issues/1553 Feature #1553
| ZFS should not trust the layers underneath regarding drive timeouts/failure
| Added by Alasdair Lumsden about 2 years ago. Updated 12 months ago.
|
| Status: New Start date: 2011-09-22 Priority: High
* This belongs in FMA
* It belongs in the sd driver
* sd is a pig to maintain.

sd is pretty bad, but not *that* bad.

The issue of timeouts is rather complicated in part because sd services so many different media, and storage busses. I'd really like to see it broken back up -- for example CDROM support in the same file as disk support causes a lot of extra complication and some decisions that could be made are complicated by the different criteria that optical media must adhere to rather than typical disks.

The other bit here is that the HBAs themselves need to play an important role -- the timeouts appropriate for iSCSI are very different from those that are appropriate for a local SAS bus.

For the most part, the work most needed is in the HBA drivers (for timeouts). Recognizing statistical patterns for a dying but not dead device, and *responding* to the holistic behavior belongs in FMA.

- Garrett

Regards,
Henk Langeveld
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Albert Lee

2013-10-07 16:42:56 UTC

Post by Ian Collins
I can't remember if the topic has come up before (I'm sure it probably

has) but I'd like to know if any work has been done to fault a drive with
an over long service time?
We have some facilities for zio latency reporting and retiring unresponsive
vdevs, but it needs more work to be generally useful.

-Albert

Post by Ian Collins
One of my systems pretty much ground to a halt with next to no write

active today, iostat showed one drive (the pool is a stripe of mirrors)
having a service time close to half a second... This wasn't good news for
the VMs using the storage.

Post by Ian Collins
I was eventually able to detach the drive and attach a spare, but it

would have been nice for this to have been recognised as an error
condition. It's probably the worst kind of error in a pool!

Post by Ian Collins
The good thing is the system eventually recovered without a reboot.
--
Ian.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now

https://www.listbox.com/member/archive/rss/182191/22051446-3e2d4190
https://www.listbox.com/member/?&

Post by Ian Collins
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

35 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Ian Collins 2013-10-07 07:04:25 UTC

Berend de Boer 2013-10-07 07:15:35 UTC

Richard Elling 2013-10-07 17:44:32 UTC

Garrett D'Amore 2013-10-07 18:04:52 UTC

Richard Elling 2013-10-07 18:15:25 UTC

Ian Collins 2013-10-07 20:11:04 UTC

Richard Elling 2013-10-08 00:19:00 UTC

Keith Wesolowski 2013-10-08 00:34:11 UTC

Ian Collins 2013-10-08 01:07:33 UTC

Richard Elling 2013-10-08 01:32:55 UTC

Garrett D'Amore 2013-10-08 02:11:37 UTC

Richard Elling 2013-10-08 02:54:39 UTC

Boris Protopopov 2013-10-08 12:30:34 UTC

Keith Wesolowski 2013-10-08 16:15:55 UTC

Jim Klimov 2013-10-09 17:38:32 UTC

Boris Protopopov 2013-10-09 18:42:58 UTC

Richard Yao 2013-10-09 23:35:29 UTC

Garrett D'Amore 2013-10-10 03:32:12 UTC

Jim Klimov 2013-10-10 10:41:29 UTC

Boris Protopopov 2013-10-10 13:03:44 UTC

Richard Yao 2013-10-11 22:49:14 UTC

Garrett D'Amore 2013-10-09 20:02:47 UTC

Ian Collins 2013-10-08 03:28:29 UTC

Richard Elling 2013-10-08 04:05:49 UTC

Ian Collins 2013-10-08 04:11:51 UTC

Richard Elling 2013-10-08 05:03:07 UTC

Schlacta, Christ 2013-10-08 19:35:43 UTC

Ian Collins 2013-10-08 20:01:28 UTC

Paul Kraus 2013-10-09 13:01:23 UTC

Henk Langeveld 2013-10-09 17:03:52 UTC

Boris Protopopov 2013-10-08 12:39:54 UTC

Keith Wesolowski 2013-10-07 16:16:20 UTC

Henk Langeveld 2013-10-08 16:39:49 UTC

Keith Wesolowski 2013-10-08 16:53:35 UTC

Garrett D'Amore 2013-10-08 17:03:32 UTC

Albert Lee 2013-10-07 16:42:56 UTC

about - legalese

Loading...