Troubleshooting Slow Write Performance

Discussion:

Ray Van Dolson

2013-08-14 06:17:37 UTC

Somewhat related to an earlier post (that post had more to do with
resilvering, but our issues may have only coincidentally coincided with
the resilver).

Large pool that is about 75% used (230TB of 310TB used) and is starting
to perform fairly poorly -- mostly with writes. Copying an 11GB file
from our root pool to our big pool takes nearly 10 minutes and while
this is happening while writes are coming in via CIFS (not very
quickly), something is definitely wrong. Our pool has 239 disks in 16
vdevs.

Theorizing this could be because we're getting fairly full or that
there's some fragmentation going on, found some dtrace scripts that
maybe help make that case. Here's one that shows metaslab allocation
times:

dtrace -n '::metaslab_alloc:entry { self->ts = timestamp} ::metaslab_alloc:return / self->ts /{ this->elapsed = timestamp
- self->ts; @lat["Time(us)"] = quantize(this->elapsed/1000); self->ts = 0} tick-5sec{ printa(@lat); trunc(@lat)}'
dtrace: description '::metaslab_alloc:entry ' matched 3 probes
CPU ID FUNCTION:NAME
20 90281 :tick-5sec
Time(us)
value ------------- Distribution ------------- count
1 | 0
2 | 4
4 |@@ 57
8 |@@@@@@@@@@ 297
16 |@@@@@@@@@@@@ 347
32 |@@@@@@ 188
64 |@ 26
128 |@ 32
256 |@ 19
512 | 12
1024 | 5
2048 | 6
4096 | 0
8192 | 0
16384 | 1
32768 | 0
65536 |@@ 54
131072 |@@@@ 110
262144 | 3
524288 | 0

20 90281 :tick-5sec
Time(us)
value ------------- Distribution ------------- count
1 | 0
2 | 5
4 |@ 46
8 |@@@@@@@ 400
16 |@@@@@@@@@@@@@@ 791
32 |@@@@@@@ 365
64 |@ 45
128 |@ 38
256 |@ 39
512 | 7
1024 | 23
2048 | 14
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 |@@ 117
131072 |@@@@@ 297
262144 | 2
524288 | 0

On other ZFS systems, allocation times are all in the lower portion.
We have a non-significant amount above 65536 us.

The other potentially useful dtrace script was one which measured
txg_delay:

dtrace -qn '::txg_delay:entry /x=1/ { @cnt[args[0]] = count(); } tick-1sec /x/ {printf("%Y count: ",walltimestamp); printa("%@d\n",@cnt); trunc(@cnt); x=0; }'
2013 Aug 13 23:14:22 count: 82
2013 Aug 13 23:14:23 count: 103
2013 Aug 13 23:14:24 count: 28
2013 Aug 13 23:14:26 count: 160
2013 Aug 13 23:14:28 count: 121

Again, on other systems we see zero occurrences of the above but on the
"slow" system the above values are extremely consistent.

Am I on the write track with the fragmentation thing (due to the pool
being fairly full)? Other things I should be looking at?

Have of course checked for slow disks and such as well...

Thanks,
Ray

Matthew Ahrens

2013-08-14 21:48:33 UTC

Permalink

Post by Ray Van Dolson
Somewhat related to an earlier post (that post had more to do with
resilvering, but our issues may have only coincidentally coincided with
the resilver).
Large pool that is about 75% used (230TB of 310TB used) and is starting
to perform fairly poorly -- mostly with writes. Copying an 11GB file
from our root pool to our big pool takes nearly 10 minutes and while
this is happening while writes are coming in via CIFS (not very
quickly), something is definitely wrong. Our pool has 239 disks in 16
vdevs.
Theorizing this could be because we're getting fairly full or that
there's some fragmentation going on, found some dtrace scripts that
maybe help make that case. Here's one that shows metaslab allocation
dtrace -n '::metaslab_alloc:entry { self->ts = timestamp}
::metaslab_alloc:return / self->ts /{ this->elapsed = timestamp
dtrace: description '::metaslab_alloc:entry ' matched 3 probes
CPU ID FUNCTION:NAME
20 90281 :tick-5sec
Time(us)
value ------------- Distribution ------------- count
1 | 0
2 | 4
512 | 12
1024 | 5
2048 | 6
4096 | 0
8192 | 0
16384 | 1
32768 | 0
262144 | 3
524288 | 0
20 90281 :tick-5sec
Time(us)
value ------------- Distribution ------------- count
1 | 0
2 | 5
512 | 7
1024 | 23
2048 | 14
4096 | 0
8192 | 0
16384 | 0
32768 | 0
262144 | 2
524288 | 0
On other ZFS systems, allocation times are all in the lower portion.
We have a non-significant amount above 65536 us.
The other potentially useful dtrace script was one which measured
x=0; }'
2013 Aug 13 23:14:22 count: 82
2013 Aug 13 23:14:23 count: 103
2013 Aug 13 23:14:24 count: 28
2013 Aug 13 23:14:26 count: 160
2013 Aug 13 23:14:28 count: 121
Again, on other systems we see zero occurrences of the above but on the
"slow" system the above values are extremely consistent.
Am I on the write track with the fragmentation thing (due to the pool
being fairly full)? Other things I should be looking at?

Yes, you are on the "write" (har har) track. We have seen similar problems
with the allocation code path when a pool is relatively full and
fragmented. One of the root causes is that the allocator changes which
metaslab it wants to allocate from too frequently. You can see that
space_map_load() is being called a lot while doing allocations. We
(specifically my colleague George Wilson) are working on a number of fixes
in this area.

--matt

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Ray Van Dolson

2013-08-14 22:19:36 UTC

Permalink

Thanks, Matt.

We threw in an SSD and set sync=always ... the log device is getting
hit now, but performance is still slow. Presumably a log device
doesn't help mitigate metaslab fragmentation type issues.

Aware of any short-term steps to minimize impact? Freeing space
presumably. :)

I'm also posting a synopsis on the Oracle ZFS list and am trying to
escalate this at Nexenta to their development team.

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray

Ray

Bill Sommerfeld

2013-08-14 22:57:43 UTC

Permalink

Post by Ray Van Dolson
We threw in an SSD and set sync=always ... the log device is getting
hit now, but performance is still slow. Presumably a log device
doesn't help mitigate metaslab fragmentation type issues.
Aware of any short-term steps to minimize impact? Freeing space
presumably. :)

Yes.

I saw similar symptoms repeatedly hit a pool I managed a few years ago. In
each case, freeing space so that usage was below a threshhold dramatically
improved write performance.

For that pool, the threshhold seemed to hit when it was around 70% (+/- a few
percent) utilized. If you can free up space, do so. Once I had an idea of
where it started happening generally able to manage free space to keep
performance out of the sad zone, but your environment may vary..

Ray Van Dolson

2013-08-14 23:03:38 UTC

Permalink

Post by Bill Sommerfeld

Yes.
I saw similar symptoms repeatedly hit a pool I managed a few years ago. In
each case, freeing space so that usage was below a threshhold dramatically
improved write performance.
For that pool, the threshhold seemed to hit when it was around 70% (+/- a few
percent) utilized. If you can free up space, do so. Once I had an idea of
where it started happening generally able to manage free space to keep
performance out of the sad zone, but your environment may vary..

Up until now we thought our "sad threshhold" (I like that) was 80%.
Perhaps it's lower now. :)

One confounder is that we have another stack of disks -- even larger
and also quite full. It still performs quite well, but is also
"younger". I'm worried now that over time as data changes on there
it'll eventually get into this same state.

Sounds like there's work being done to improve things so may just have
to survive until those changes are completed and fed back into
NexentaStor.

Ray

Matthew Ahrens

2013-08-14 23:08:00 UTC

Permalink

Post by Ray Van Dolson

Post by Ray Van Dolson
Somewhat related to an earlier post (that post had more to do with
resilvering, but our issues may have only coincidentally coincided

with

Post by Ray Van Dolson
the resilver).
Large pool that is about 75% used (230TB of 310TB used) and is

starting

Post by Ray Van Dolson
to perform fairly poorly -- mostly with writes. Copying an 11GB file
from our root pool to our big pool takes nearly 10 minutes and while
this is happening while writes are coming in via CIFS (not very
quickly), something is definitely wrong. Our pool has 239 disks in

Post by Ray Van Dolson
vdevs.
Theorizing this could be because we're getting fairly full or that
there's some fragmentation going on, found some dtrace scripts that
maybe help make that case. Here's one that shows metaslab allocation
dtrace -n '::metaslab_alloc:entry { self->ts = timestamp}
::metaslab_alloc:return / self->ts /{ this->elapsed = timestamp

self->ts = 0}

Post by Ray Van Dolson
dtrace: description '::metaslab_alloc:entry ' matched 3 probes
CPU ID FUNCTION:NAME
20 90281 :tick-5sec
Time(us)
value ------------- Distribution ------------- count
1 | 0
2 | 4
512 | 12
1024 | 5
2048 | 6
4096 | 0
8192 | 0
16384 | 1
32768 | 0
262144 | 3
524288 | 0
20 90281 :tick-5sec
Time(us)
value ------------- Distribution ------------- count
1 | 0
2 | 5
512 | 7
1024 | 23
2048 | 14
4096 | 0
8192 | 0
16384 | 0
32768 | 0
262144 | 2
524288 | 0
On other ZFS systems, allocation times are all in the lower portion.
We have a non-significant amount above 65536 us.
The other potentially useful dtrace script was one which measured

tick-1sec

Post by Ray Van Dolson
x=0; }'
2013 Aug 13 23:14:22 count: 82
2013 Aug 13 23:14:23 count: 103
2013 Aug 13 23:14:24 count: 28
2013 Aug 13 23:14:26 count: 160
2013 Aug 13 23:14:28 count: 121
Again, on other systems we see zero occurrences of the above but on

the

Post by Ray Van Dolson
"slow" system the above values are extremely consistent.
Am I on the write track with the fragmentation thing (due to the pool
being fairly full)? Other things I should be looking at?
Yes, you are on the "write" (har har) track. We have seen similar

problems

Post by Ray Van Dolson
with the allocation code path when a pool is relatively full and

fragmented.

Post by Ray Van Dolson
One of the root causes is that the allocator changes which metaslab it

wants

Post by Ray Van Dolson
to allocate from too frequently. You can see that space_map_load() is

being

Post by Ray Van Dolson
called a lot while doing allocations. We (specifically my colleague

George

Post by Ray Van Dolson
Wilson) are working on a number of fixes in this area.
--matt

I'm assuming you added the SSD as a log device. If your writes were
already synchronous, that could improve performance a bit. Setting
sync=always will not improve performance.

--matt

Post by Ray Van Dolson
Aware of any short-term steps to minimize impact? Freeing space
presumably. :)
I'm also posting a synopsis on the Oracle ZFS list and am trying to
escalate this at Nexenta to their development team.

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray

Ray
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Ray Van Dolson

2013-08-14 23:11:09 UTC

Permalink

Post by Ray Van Dolson

Post by Ray Van Dolson
Somewhat related to an earlier post (that post had more to do with
resilvering, but our issues may have only coincidentally coincided

with

Post by Ray Van Dolson
the resilver).
Large pool that is about 75% used (230TB of 310TB used) and is

starting

= 0}

tick-1sec

the

problems

Post by Ray Van Dolson
with the allocation code path when a pool is relatively full and

fragmented.

Post by Ray Van Dolson
One of the root causes is that the allocator changes which metaslab it

wants

Post by Ray Van Dolson
to allocate from too frequently. You can see that space_map_load() is

being

Post by Ray Van Dolson
called a lot while doing allocations. We (specifically my colleague

George

Post by Ray Van Dolson
Wilson) are working on a number of fixes in this area.
--matt

Thanks, Matt.
We threw in an SSD and set sync=always ... the log device is getting
hit now, but performance is still slow. Presumably a log device
doesn't help mitigate metaslab fragmentation type issues.
I'm assuming you added the SSD as a log device. If your writes were already
synchronous, that could improve performance a bit. Setting sync=always will
not improve performance.
--matt

Yes, the device was added as a log device. The writes were not O_SYNC
before, but we thought if we set sync=always we might be able to shield
even ASYNC writes from the metaslab allocation delays by sending them
to SSD first. GUess that was wrong. :)

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray

Ray

George Wilson

2013-08-14 23:10:14 UTC

Permalink

Post by Ray Van Dolson

Thanks, Matt.
We threw in an SSD and set sync=always ... the log device is getting
hit now, but performance is still slow. Presumably a log device
doesn't help mitigate metaslab fragmentation type issues.
Aware of any short-term steps to minimize impact? Freeing space
presumably. :)
I'm also posting a synopsis on the Oracle ZFS list and am trying to
escalate this at Nexenta to their development team.

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray

Ray

One think you can try is to enable 'metaslab_debug' to keep the metaslab
loaded. I suspect the long alloc times are times we have to read a
relatively large space map off disk. This causes you to stall waiting on
the synchronous read. If you have memory to spare you can try to do the
following:

echo "metaslab_debug/W 1" | mdb -kw

Then monitor the alloc times and see if that helps. You can revert the
change by running:

echo "metaslab_debug/W 0" | mdb -kw

BTW, have you measured the latency of your disks?

- George

Ray Van Dolson

2013-08-14 23:56:33 UTC

Permalink

Post by George Wilson

Post by Ray Van Dolson

Thanks, Matt.
We threw in an SSD and set sync=always ... the log device is getting
hit now, but performance is still slow. Presumably a log device
doesn't help mitigate metaslab fragmentation type issues.
Aware of any short-term steps to minimize impact? Freeing space
presumably. :)
I'm also posting a synopsis on the Oracle ZFS list and am trying to
escalate this at Nexenta to their development team.

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray

Ray

One think you can try is to enable 'metaslab_debug' to keep the
metaslab loaded. I suspect the long alloc times are times we have to
read a relatively large space map off disk. This causes you to stall
waiting on the synchronous read. If you have memory to spare you can
echo "metaslab_debug/W 1" | mdb -kw
Then monitor the alloc times and see if that helps. You can revert
echo "metaslab_debug/W 0" | mdb -kw
BTW, have you measured the latency of your disks?
- George

Wow. That was dramatic! These boxes each have 128GB of memory so
presumably plenty for holding the space map. My allocation times are
all < 64us again and am no longer seeing txg_delays. Also was able to
copy a file onto the pool at 350MB/sec (presumably limited by the
source disks).

Couple of questions:

- How can we see what the size of the space maps are? Either per dev
or overall? More or less curious to see if everything can (easily)
fit in memory.

- Could space maps be cached in L2ARC/ARC via some tunable or by
default?

In any case, this has given us some breathing room while we look to
purge off some data.

Thanks so much.

Ray

Ray Van Dolson

2013-08-15 00:13:17 UTC

Permalink

Post by George Wilson
BTW, have you measured the latency of your disks?

Forgot to answer this.

Yes -- nothing jumps out at me as being too worrisome currently:

# iostat -xen 30 | awk '$15 ~ /c.*/ { if ($8 > 10) { printf "%-25s %s\n", $15, $8 } }'
device asvc_t
device asvc_t
c8t50000394C8D2DDF3d0 12.2
device asvc_t
c8t50000394C8D2DDF3d0 18.9
device asvc_t
c8t50000394C8D2DDF3d0 11.1
device asvc_t
c8t50000394C8D2DDF3d0 15.0
device asvc_t
device asvc_t

Only a short sample, but you can see there's one disk that periodically
jumps over 10ms. We may look at replacing it.

Ray

Richard Elling

2013-08-19 20:04:03 UTC

Permalink

Post by Ray Van Dolson

Post by George Wilson
BTW, have you measured the latency of your disks?

Forgot to answer this.
# iostat -xen 30 | awk '$15 ~ /c.*/ { if ($8 > 10) { printf "%-25s %s\n", $15, $8 } }'
device asvc_t
device asvc_t
c8t50000394C8D2DDF3d0 12.2
device asvc_t
c8t50000394C8D2DDF3d0 18.9
device asvc_t
c8t50000394C8D2DDF3d0 11.1
device asvc_t
c8t50000394C8D2DDF3d0 15.0
device asvc_t
device asvc_t
Only a short sample, but you can see there's one disk that periodically
jumps over 10ms. We may look at replacing it.

Those numbers are fine for HDDs and many SSDs. The actual response times
can range up to nearly a second -- or as defined in the disk's configuration,
enterprise disks can have a tunable limit. A better gauge is the disk's logs for
error recovery, look for errors corrected with possible delays in the error counters
log page.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

George Wilson

2013-08-15 13:50:40 UTC

Permalink

Post by Ray Van Dolson

Post by George Wilson

Post by Ray Van Dolson

Thanks, Matt.
We threw in an SSD and set sync=always ... the log device is getting
hit now, but performance is still slow. Presumably a log device
doesn't help mitigate metaslab fragmentation type issues.
Aware of any short-term steps to minimize impact? Freeing space
presumably. :)
I'm also posting a synopsis on the Oracle ZFS list and am trying to
escalate this at Nexenta to their development team.

Post by Ray Van Dolson
Have of course checked for slow disks and such as well...
Thanks,
Ray

Ray

One think you can try is to enable 'metaslab_debug' to keep the
metaslab loaded. I suspect the long alloc times are times we have to
read a relatively large space map off disk. This causes you to stall
waiting on the synchronous read. If you have memory to spare you can
echo "metaslab_debug/W 1" | mdb -kw
Then monitor the alloc times and see if that helps. You can revert
echo "metaslab_debug/W 0" | mdb -kw
BTW, have you measured the latency of your disks?
- George

You can get a rough estimate by running 'zdb -mm <pool>'. This will dump
out the number of segments in each metaslab and each segment is 64
bytes. Here's a snippet of the 'zdb' output:

Metaslabs:
vdev 0
metaslabs 124 offset spacemap free
--------------- ------------------- ---------------
-------------
metaslab 0 offset 0 spacemap 39
free 237M
segments 11547 maxsize 3.38M
freepct 11%
metaslab 1 offset 80000000 spacemap 50
free 109M
segments 8565 maxsize 153K
freepct 5%
metaslab 2 offset 100000000 spacemap 54
free 114M
segments 13058 maxsize 249K
freepct 5%

Post by Ray Van Dolson
- Could space maps be cached in L2ARC/ARC via some tunable or by
default?

The space map blocks are cached in the L2ARC/ARC by default.

Thanks,
George

Post by Ray Van Dolson
In any case, this has given us some breathing room while we look to
purge off some data.
Thanks so much.
Ray

Ray Van Dolson

2013-08-16 06:23:14 UTC

Permalink

Post by George Wilson

Post by Ray Van Dolson
- How can we see what the size of the space maps are? Either per dev
or overall? More or less curious to see if everything can (easily)
fit in memory.

You can get a rough estimate by running 'zdb -mm <pool>'. This will
dump out the number of segments in each metaslab and each segment is
vdev 0
metaslabs 124 offset spacemap free
--------------- ------------------- ---------------
-------------
metaslab 0 offset 0 spacemap 39 free
237M
segments 11547 maxsize 3.38M
freepct 11%
metaslab 1 offset 80000000 spacemap 50 free
109M
segments 8565 maxsize 153K
freepct 5%
metaslab 2 offset 100000000 spacemap 54 free
114M
segments 13058 maxsize 249K
freepct 5%

Post by Ray Van Dolson
- Could space maps be cached in L2ARC/ARC via some tunable or by
default?

The space map blocks are cached in the L2ARC/ARC by default.

With the amount of memory our system had, I'm guessing either (a) the
pressure of "other" things that also sit in ARC were pushing some of
the space map blocks out or (b) the space map blocks weren't accessed
frequently to stay warm enough to be in ARC?

It sounds like the addition of an L2ARC device might also help here
then, correct? Are there any knobs that we can tune to ensure the
space maps are more likely to stay in ARC/L2ARC than other data?

Is there an Illumos bug where I can track activity for fixes in this
area?

Post by George Wilson
Thanks,
George

Post by Ray Van Dolson
In any case, this has given us some breathing room while we look to
purge off some data.
Thanks so much.
Ray

Thanks,
Ray