Discussion:
Unexpect write pattern
Ian Collins
2013-08-11 22:23:05 UTC
Permalink
I've been looking at the performance of some ZFS storage used by
VmWare  guests (all via NFS) that has a reputation for being
"slow".  I've noticed that there are a significant number of small
(look like 4K) writes going straight to disk.

The pool comprises a stripe of 10 two way mirrors + two cache and two
log devices.

A typical pool drive in iostat:

    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w 
%b device
    0.1   36.3    0.0    0.5  0.0  0.2    0.0   
6.6   0   4 c4t5000C50014CCB518d0

The log devices:

    8.2    00    0.0    0.0  0.0  0.0    0.0   
0.3   0   0 c3t1d0

One mirror from zpool iostat:

pool                                   
alloc   free   read  write   read  write
 
mirror                                
731G   197G      7    192   688K   685K
    c4t5000C50014CCAAB7d0                  
-      -      3     31   302K   685K
    c4t5000C50014CCB518d0                  
-      -      4     32   386K   685K

As you can see, there is a high ratio of write operations to write
bandwidth.  I wouldn't expect to see this with ZFS unless the writes
were synchronous.  If they were, I would expect to see most of the
IPOs go to the logs, which isn't the case.

Any ideas?
--
Ian





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-08-11 23:09:03 UTC
Permalink
I've been looking at the performance of some ZFS storage used by VmWare
guests (all via NFS) that has a reputation for being "slow". I've
noticed that there are a significant number of small (look like 4K)
writes going straight to disk.
The pool comprises a stripe of 10 two way mirrors + two cache and two
log devices.
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
01 36.3 0.0 0.5 0.0 0.2 0.0 6.6 0 4
c4t5000C50014CCB518d0
8.2 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0 0 c3t1d0
pool alloc free read write
read write
mirror 731G 197G 7 192
688K 685K
c4t5000C50014CCAAB7d0 - - 3 31
302K 685K
c4t5000C50014CCB518d0 - - 4 32
386K 685K
As you can see, there is a high ratio of write operations to write
bandwidth. I wouldn't expect to see this with ZFS unless the writes
were synchronous. If they were, I would expect to see most of the IPOs
go to the logs, which isn't the case.
Any ideas?
Just out of curiosity, can you try to do an fsstat on your filesystem?

# fsstat /<filesystem> 1

It'd be interesting to see what's actually entering the VFS layer here.

Cheers,
--
Saso
Saso Kiselkov
2013-08-11 23:14:19 UTC
Permalink
P.S. also see zilstat for info how your workload interacts with the zil:
http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

Cheers,
--
Saso
Ian Collins
2013-08-12 07:38:45 UTC
Permalink
Hello Saso,
Most of the I/O goes to one filesystem
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 83 0 0 0 8 400K 33 57.4K
/export/high
0 0 0 62 0 0 0 5 216K 26 212K
/export/high
0 0 0 42 0 0 0 5 216K 16 54.1K
/export/high
0 0 0 34 0 0 0 4 200K 13 56.0K
/export/high
0 0 0 98 0 0 0 9 416K 40 38.6K
/export/high
0 0 0 46 0 0 0 6 218K 15 77.0K
/export/high
0 0 0 52 0 0 0 5 216K 21 122K
/export/high
0 0 0 29 0 0 0 4 200K 10 162K
/export/high
0 0 0 40 0 0 0 9 416K 11 38.2K
/export/high
0 0 0 113 0 0 0 9 280K 47 237K
/export/high
0 0 0 112 0 0 0 5 216K 51 133K
/export/high
0 0 0 69 0 0 0 9 416K 25 75.5K
/export/high
0 0 0 66 0 0 0 5 216K 28 202K
/export/high
I'll try zilstat later when I get home.
The zilstat results were unremarkable, along the lines of

waiting for txg commit...
txg N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s
B-Max-Rate ops <=4kB 4-32kB >=32kB
15581296 1532928 383232 439096 15699968 3924992
5234688 420 197 39 184
15581297 2700008 540001 1300536 17813504 3562700
5001216 289 0 23 266
15581298 1214536 242907 358216 13254656 2650931
4956160 243 0 82 161
15581299 2121088 424217 1445376 11190272 2238054
4890624 166 0 16 150
15581300 835856 167171 187720 9797632 1959526
3620864 170 0 25 145
15581301 2119416 423883 1165728 15384576 3076915
4259840 167 0 4 163
15581302 2036768 407353 942064 19234816 3846963
6082560 256 0 23 233
15581303 1803904 360780 805696 16871424 3374284
6590464 254 0 28 226

with occasional groups with higher small writes:

15581309 2797112 559422 825232 25776128 5155225
7098368 927 320 306 301

I'm wondering if zpool iostat is simply getting it wrong and reporting
too high a value for the mirror vdevs and pool. The iostat numbers
agree with the individual drive numbers reported by zpool iostat.
--
Ian.
Saso Kiselkov
2013-08-12 08:49:46 UTC
Permalink
Post by Ian Collins
Hello Saso,
Most of the I/O goes to one filesystem
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 83 0 0 0 8 400K 33 57.4K
/export/high
0 0 0 62 0 0 0 5 216K 26 212K
/export/high
0 0 0 42 0 0 0 5 216K 16 54.1K
/export/high
0 0 0 34 0 0 0 4 200K 13 56.0K
/export/high
0 0 0 98 0 0 0 9 416K 40 38.6K
/export/high
0 0 0 46 0 0 0 6 218K 15 77.0K
/export/high
0 0 0 52 0 0 0 5 216K 21 122K
/export/high
0 0 0 29 0 0 0 4 200K 10 162K
/export/high
0 0 0 40 0 0 0 9 416K 11 38.2K
/export/high
0 0 0 113 0 0 0 9 280K 47 237K
/export/high
0 0 0 112 0 0 0 5 216K 51 133K
/export/high
0 0 0 69 0 0 0 9 416K 25 75.5K
/export/high
0 0 0 66 0 0 0 5 216K 28 202K
/export/high
I'll try zilstat later when I get home.
The zilstat results were unremarkable, along the lines of
waiting for txg commit...
txg N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s
B-Max-Rate ops <=4kB 4-32kB >=32kB
15581296 1532928 383232 439096 15699968 3924992
5234688 420 197 39 184
15581297 2700008 540001 1300536 17813504 3562700
5001216 289 0 23 266
15581298 1214536 242907 358216 13254656 2650931
4956160 243 0 82 161
15581299 2121088 424217 1445376 11190272 2238054
4890624 166 0 16 150
15581300 835856 167171 187720 9797632 1959526
3620864 170 0 25 145
15581301 2119416 423883 1165728 15384576 3076915
4259840 167 0 4 163
15581302 2036768 407353 942064 19234816 3846963
6082560 256 0 23 233
15581303 1803904 360780 805696 16871424 3374284
6590464 254 0 28 226
This might actually be a bit of a problem, because of
zfs_immediate_write_sz (which is 32k by default). This means that any
sync write that is above that size will miss the slog and go straight to
the underlying storage. Can you try doing the following:

# echo zfs_immediate_write_sz/W0t131072 | mdb -kw

And seeing what happens? Just as an experiment, you can then revert it
back using:

# echo zfs_immediate_write_sz/W0t32768 | mdb -kw
Post by Ian Collins
15581309 2797112 559422 825232 25776128 5155225
7098368 927 320 306 301
I'm wondering if zpool iostat is simply getting it wrong and reporting
too high a value for the mirror vdevs and pool. The iostat numbers
agree with the individual drive numbers reported by zpool iostat.
Cheers,
--
Saso
Saso Kiselkov
2013-08-12 08:52:02 UTC
Permalink
Post by Saso Kiselkov
Post by Ian Collins
Hello Saso,
Most of the I/O goes to one filesystem
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 83 0 0 0 8 400K 33 57.4K
/export/high
0 0 0 62 0 0 0 5 216K 26 212K
/export/high
0 0 0 42 0 0 0 5 216K 16 54.1K
/export/high
0 0 0 34 0 0 0 4 200K 13 56.0K
/export/high
0 0 0 98 0 0 0 9 416K 40 38.6K
/export/high
0 0 0 46 0 0 0 6 218K 15 77.0K
/export/high
0 0 0 52 0 0 0 5 216K 21 122K
/export/high
0 0 0 29 0 0 0 4 200K 10 162K
/export/high
0 0 0 40 0 0 0 9 416K 11 38.2K
/export/high
0 0 0 113 0 0 0 9 280K 47 237K
/export/high
0 0 0 112 0 0 0 5 216K 51 133K
/export/high
0 0 0 69 0 0 0 9 416K 25 75.5K
/export/high
0 0 0 66 0 0 0 5 216K 28 202K
/export/high
I'll try zilstat later when I get home.
The zilstat results were unremarkable, along the lines of
waiting for txg commit...
txg N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s
B-Max-Rate ops <=4kB 4-32kB >=32kB
15581296 1532928 383232 439096 15699968 3924992
5234688 420 197 39 184
15581297 2700008 540001 1300536 17813504 3562700
5001216 289 0 23 266
15581298 1214536 242907 358216 13254656 2650931
4956160 243 0 82 161
15581299 2121088 424217 1445376 11190272 2238054
4890624 166 0 16 150
15581300 835856 167171 187720 9797632 1959526
3620864 170 0 25 145
15581301 2119416 423883 1165728 15384576 3076915
4259840 167 0 4 163
15581302 2036768 407353 942064 19234816 3846963
6082560 256 0 23 233
15581303 1803904 360780 805696 16871424 3374284
6590464 254 0 28 226
This might actually be a bit of a problem, because of
zfs_immediate_write_sz (which is 32k by default). This means that any
sync write that is above that size will miss the slog and go straight to
# echo zfs_immediate_write_sz/W0t131072 | mdb -kw
And seeing what happens? Just as an experiment, you can then revert it
# echo zfs_immediate_write_sz/W0t32768 | mdb -kw
And one more thing: can you do a zfs get logbias on your filesystems and
check that the value hasn't been changed to throughput?

Cheers,
--
Saso
Ian Collins
2013-08-12 09:12:12 UTC
Permalink
Post by Saso Kiselkov
Post by Ian Collins
The zilstat results were unremarkable, along the lines of
waiting for txg commit...
txg N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s
B-Max-Rate ops <=4kB 4-32kB >=32kB
15581296 1532928 383232 439096 15699968 3924992
5234688 420 197 39 184
15581297 2700008 540001 1300536 17813504 3562700
5001216 289 0 23 266
15581298 1214536 242907 358216 13254656 2650931
4956160 243 0 82 161
15581299 2121088 424217 1445376 11190272 2238054
4890624 166 0 16 150
15581300 835856 167171 187720 9797632 1959526
3620864 170 0 25 145
15581301 2119416 423883 1165728 15384576 3076915
4259840 167 0 4 163
15581302 2036768 407353 942064 19234816 3846963
6082560 256 0 23 233
15581303 1803904 360780 805696 16871424 3374284
6590464 254 0 28 226
This might actually be a bit of a problem, because of
zfs_immediate_write_sz (which is 32k by default). This means that any
sync write that is above that size will miss the slog and go straight to
# echo zfs_immediate_write_sz/W0t131072 | mdb -kw
And seeing what happens? Just as an experiment, you can then revert it
# echo zfs_immediate_write_sz/W0t32768 | mdb -kw
Thanks for the memory jog, I've seen this problem before testing MS SQL
VMs on SmartOS. I don't think they would let my mess with this box too
much during the week, there are too many production VMs using it.

Give they have been using a 32k block size, the log devices probably
don't see much use. I'll see if I can increase zfs_immediate_write_sz
or move some VMs to a share with a smaller block size next weekend.

Thanks,
--
Ian.
Boris Protopopov
2013-08-12 13:52:33 UTC
Permalink
Hello, Ian,

my information could be out of date, but if this is NFS data store we are
talking about, then, in fact, all write requests issued by VMware NFS v3
client are synchronous. You can verify that in the TCP trace -
typically, write requests include FILE_SYNC (stable_how) flag that
instructs the server to sync the write. To my knowledge, this is how VMware
resolves the issue of "crash consistency", among other things - by not
caching anything on the client.

Boris.
I've been looking at the performance of some ZFS storage used by VmWare
guests (all via NFS) that has a reputation for being "slow". I've noticed
that there are a significant number of small (look like 4K) writes going
straight to disk.
The pool comprises a stripe of 10 two way mirrors + two cache and two log
devices.
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
01 36.3 0.0 0.5 0.0 0.2 0.0 6.6 0 4
c4t5000C50014CCB518d0
8.2 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0 0 c3t1d0
pool alloc free read write read
write
mirror 731G 197G 7 192
688K 685K
c4t5000C50014CCAAB7d0 - - 3 31
302K 685K
c4t5000C50014CCB518d0 - - 4 32
386K 685K
As you can see, there is a high ratio of write operations to write
bandwidth. I wouldn't expect to see this with ZFS unless the writes were
synchronous. If they were, I would expect to see most of the IPOs go to
the logs, which isn't the case.
Any ideas?
--
Ian
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
--
Boris Protopopov
Nexenta Systems

455 El Camino Real, Santa Clara, CA 95050

[d] 408.791.3366 | [c] 978.621.6901
Skype: bprotopopov



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-08-12 16:57:47 UTC
Permalink
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
-- richard
I've been looking at the performance of some ZFS storage used by VmWare guests (all via NFS) that has a reputation for being "slow". I've noticed that there are a significant number of small (look like 4K) writes going straight to disk.
The pool comprises a stripe of 10 two way mirrors + two cache and two log devices.
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
01 36.3 0.0 0.5 0.0 0.2 0.0 6.6 0 4 c4t5000C50014CCB518d0
8.2 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0 0 c3t1d0
pool alloc free read write read write
mirror 731G 197G 7 192 688K 685K
c4t5000C50014CCAAB7d0 - - 3 31 302K 685K
c4t5000C50014CCB518d0 - - 4 32 386K 685K
As you can see, there is a high ratio of write operations to write bandwidth. I wouldn't expect to see this with ZFS unless the writes were synchronous. If they were, I would expect to see most of the IPOs go to the logs, which isn't the case.
Any ideas?
--
Ian
illumos-zfs | Archives | Modify Your Subscription
--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins
2013-08-13 03:22:38 UTC
Permalink
Post by Richard Elling
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
Thanks Richard. So many tools, so little time!
--
Ian.
Ian Collins
2013-08-13 10:57:07 UTC
Permalink
Post by Richard Elling
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
A useful tool, thanks.

I was surprised to see more than 50% of the sync writes were misaligned
and way higher service times than my much lower spec test box, which
suggests the VMs haven't been well configured.
--
Ian.
Richard Elling
2013-08-13 17:54:29 UTC
Permalink
Post by Ian Collins
Post by Richard Elling
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
A useful tool, thanks.
I was surprised to see more than 50% of the sync writes were misaligned and way higher service times than my much lower spec test box, which suggests the VMs haven't been well configured.
Actually, probably not. I did a bunch of testing on response times as a function of
misalignment and never saw a statistically significant difference. The following dtrace
helps:

dtrace -qn 'nfsv3:::op-read-start {this->ts=timestamp;this->align=args[2]->offset%4096;this->align=this->align > 2048 ? this->align=4096 - this->align : this->align} nfsv3:::op-read-done /this->ts/ {@t[this->align]=avg((timestamp - this->ts)/1000);@c[this->align]=count();this->ts=0} dtrace:::END {printf("NFSv3 write alignment analysis, %Y\n%20s %20s %20s\n",walltimestamp,"Misalignment (bytes)","Count","Avg_t (microsecs)");printa("%20d %***@d %***@d\n", @c, @t)}'

-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins
2013-08-13 21:33:04 UTC
Permalink
Post by Richard Elling
Post by Ian Collins
Post by Richard Elling
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
A useful tool, thanks.
I was surprised to see more than 50% of the sync writes were
misaligned and way higher service times than my much lower spec test
box, which suggests the VMs haven't been well configured.
Actually, probably not. I did a bunch of testing on response times as a function of
misalignment and never saw a statistically significant difference. The following dtrace
dtrace -qn 'nfsv3:::op-read-start
{this->ts=timestamp;this->align=args[2]->offset%4096;this->align=this->align
Post by Ian Collins
2048 ? this->align=4096 - this->align : this->align}
{printf("NFSv3 write alignment analysis, %Y\n%20s %20s
%20s\n",walltimestamp,"Misalignment (bytes)","Count","Avg_t
Thanks again Richard.

There looks to be some significance here:

NFSv3 write alignment analysis, 2013 Aug 13 21:30:31
Misalignment (bytes) Count Avg_t (microsecs)
2048 89 2833
1024 199 22105
1536 330 14292
512 3412 2837
0 3458 1423
--
Ian.
Richard Elling
2013-08-17 21:11:49 UTC
Permalink
Hi Ian,
more below…
Post by Ian Collins
Post by Richard Elling
Post by Ian Collins
Post by Richard Elling
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
A useful tool, thanks.
I was surprised to see more than 50% of the sync writes were misaligned and way higher service times than my much lower spec test box, which suggests the VMs haven't been well configured.
Actually, probably not. I did a bunch of testing on response times as a function of
misalignment and never saw a statistically significant difference. The following dtrace
Thanks again Richard.
NFSv3 write alignment analysis, 2013 Aug 13 21:30:31
Misalignment (bytes) Count Avg_t (microsecs)
2048 89 2833
1024 199 22105
1536 330 14292
512 3412 2837
0 3458 1423
I agree, more analysis is needed. This (big) one-liner provides averages, so it is
really an indicator that you might want to look at more detail. Pealing back another
layer will show if there is one file or client that is causing all of the variance, or if
there is a common, systemic cause. In this slightly smaller one-liner, we can dig into
what is misaligned and what is the "cost" of the misalignment.

dtrace -n 'nfsv3:::op-write-start {this->ts=timestamp;this->align=args[2]->offset%4096;this->file=args[1]->noi_curpath;this->size=args[2]->data.data_len/1024} nfsv3:::op-write-done /this->ts/ {@[this->file,this->size,this->align]=avg((timestamp - this->ts)/1000);this->ts=0} profile:::tick-10sec {exit(0)}' | sort

-- richard
--
ZFS and performance consulting
http://www.RichardElling.com
Ian Collins
2013-08-18 05:11:15 UTC
Permalink
Post by Richard Elling
Hi Ian,
more below…
Post by Ian Collins
Post by Richard Elling
Post by Ian Collins
Post by Richard Elling
Hi Ian,
Get nfssvrtop. It shows the latency of servicing NFS on a per-client basis.
A useful tool, thanks.
I was surprised to see more than 50% of the sync writes were misaligned and way higher service times than my much lower spec test box, which suggests the VMs haven't been well configured.
Actually, probably not. I did a bunch of testing on response times as a function of
misalignment and never saw a statistically significant difference. The following dtrace
Thanks again Richard.
NFSv3 write alignment analysis, 2013 Aug 13 21:30:31
Misalignment (bytes) Count Avg_t (microsecs)
2048 89 2833
1024 199 22105
1536 330 14292
512 3412 2837
0 3458 1423
I agree, more analysis is needed. This (big) one-liner provides averages, so it is
really an indicator that you might want to look at more detail. Pealing back another
layer will show if there is one file or client that is causing all of the variance, or if
there is a common, systemic cause. In this slightly smaller one-liner, we can dig into
what is misaligned and what is the "cost" of the misalignment.
Thanks again Richard,

I'll gather up some data and see if I can spot any patterns. At a
glance, the bigger service times (I've seen over a second for 8K) are
all on a misalignment of 3584. But not all are big.
--
Ian.
Loading...