On 9 Oct 15:01 , Paul Kraus wrote in reply to: "Schlacta, Christ"
Post by Paul KrausI have seen drive populations where there are two (or more) sets of
"normal" responses. The best example I can recall (I have been away
from managing large numbers of spindles for over a year now) was a
server with 120 750GB Sun badged SATA drives. The Seagate drives had
a very noticeably different service time than the Hitachi drives.
Within each population it was easy to spot a drive starting to go bad
from the iostat service time values over time (60 second sample over
the work day). Since the drives were all in one zpool (or a hot
spare), and had been since the inception of the zpool, the workload
on each was very similar.
Sorting iostat output by service time led to three groupings, the
Seagate, the Hitachi, and the hot spares (I forget whether the
Seagate or the Hitachi were faster). You could not spot a hot spare
going bad (not enough activity), but spotting a drive failing in
either of the others was easy.
Incidentally, I was testing two methods to collect stats on drive delay:
1. scanning kstat quite aggressively (1s. intervals) for sd:::[wr]cnt
It's quite amusing to see all disks in the pool hit
sd:*:*:rcnt values of 10, while sd:*:*:wcnt counts stay quite low.
This is what I would expect for a well behaving set of disks.
Annoyingly, kstat only knows about 'sd' targets, no c*t*d* here.
2. Scanning the wait/actv/wsvc_t/ascv_t and %w and %b columns of iostat.
An interesting observation here is that you can include the controller
(-C) and see combinations of high %b for the controller(s) with high %w
for the pool. Again, this appears logical; the disks get fed the
traffic they're expected to handle, and the pool sits there waiting.
Again, individual drives mostly show extreme low values for wait and
wsvc_t. A simple sort on the wsvc_t and asvc_t columns will show the
disks with the higher delay at the bottom.
HOWEVER,
Some samples show *huge* numbers for these columns.
***@omnios-vagrant:/export/home/vagrant# LANG=C TZ=UTC iostat -Td
-Cxznd 10 2
Wed Oct 9 16:56:49 UTC 2013
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
2.1 56.6 137.0 6465.3 198669.0 198666.8 3384810.6 3384774.1 5
3 c2
2.1 56.6 137.0 6465.3 198669.0 198666.8 3384810.6 3384774.1 5
3 c2t0d0
Wed Oct 9 16:56:59 UTC 2013
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
3.6 401.9 284.8 46681.8 0.2 1858604528.4 0.5 4583041961.9 7
18 c2
3.6 401.9 284.8 46681.8 0.2 1858604528.4 0.5 4583041961.9 7
18 c2t0d0
You might think this would be a VM artifact, and so did I at first.
I found that our production ZFS pool shows the same weird stats,
although they are rare.
Is this a case of a sampling error, and the counters/timers don't get
reset properly?
Followup to ***@...
Regards,
Henk Langeveld