Trey -
Your setup is similar to mine. I have the exact same 10g card, the Intel
X520 (X520-SR2 in my case), with the same MTU. Have you tried to increase
the MTU to 9000? Not sure if mbuffer would take advantage of that or not.
My server just does NFS server duty.
Google has a lz4 command line utility that will do stdin/stdout lz4
compression. You should give that a go. https://code.google.com/p/lz4/
My data is all audio files in various codecs which aren't compressible. So
using compression would just result in wasted cpu cycles.
Matt,
I was running mbuffer on both sides, but I had a different block size on
the receiving side which i've fixed. That helped a bit. I've noticed that
the sending side buffer is at 100%, but the receiving side buffer is always
at 0%.
Looks like a network problem, so I played around with different block sizes
in mbuffer...
1M = 200MB/sec
128k = 210MB/sec
64k = 319MB/sec
32k = 412MB/sec
16k = 350MB/sec
So it looks like, at least on Illumos, a block size of 32k results in the
best performance. The sweet spot seems to be a 32k block size and a 1G
buffer, both 512M and 2G buffer result in slower performance for whatever
reason.
Initially my zpool is at about 15% busy, with the sender buffer at 100% and
the recv buffer at 0%, but after a few minutes into the transfer the busy
jumps to 100% and the recv buffer creeps up to 100%. So its now waiting on
the disks and the transfer slows to around 350MB/sec
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.0 4507.9 0.0 384.1 2672.3 172.6 631.1 13 15
data 0.0 4644.8 0.0 391.8 2549.9 165.1 584.5 12 14
data 0.0 4706.7 0.0 397.7 2545.1 170.5 577.0 14 15
data 0.0 4594.4 0.0 392.2 2769.0 172.2 640.2 14 16
data 0.0 4488.8 0.0 379.4 2496.2 167.7 593.5 13 14
data 0.0 4685.9 0.0 390.7 2622.7 173.5 596.7 14 16
data 0.0 4704.3 0.0 406.4 2751.5 179.7 623.1 14 15
data 0.0 4501.0 0.0 389.1 2844.6 172.0 670.2 13 14
data 0.0 4869.8 0.0 410.2 2716.0 170.7 592.8 16 26
data 0.4 4788.7 0.0 422.6 2668.0 169.1 592.4 21 42
data 1.2 4811.2 0.0 428.0 2126.1 170.5 477.2 19 47
data 1.0 4880.7 0.0 419.4 2335.0 170.0 513.1 30 60
data 7.4 4472.1 0.0 391.3 2577.5 183.1 616.3 85 100
data 6.0 4304.9 0.0 385.0 2884.7 181.8 711.3 82 100
data 8.8 4537.0 0.0 393.3 2936.5 183.1 686.3 87 100
data 12.8 4641.4 0.0 392.8 2875.0 178.3 656.0 88 100
data 15.2 4625.7 0.0 395.5 3145.5 178.9 716.3 87 100
data 13.0 4329.9 0.0 382.0 2720.2 179.5 667.7 84 100
data 13.0 4668.2 0.0 418.1 3213.0 194.0 727.8 84 100
data 26.5 4470.0 0.0 389.8 2754.4 184.9 653.7 89 100
data 10.2 4834.5 0.0 395.5 2569.6 182.5 568.1 83 100
data 15.8 4405.2 0.0 398.2 2919.4 190.8 703.5 87 100
data 20.4 4577.7 0.0 388.2 2744.7 180.0 636.1 86 100
Looks like there are a few number of reads that start at the same time the
busy increases... but I wouldn't think 15 reads/sec would cause that much
usage? I tested it a few times, anytime there is some reads the busy time
increases. weird.
Here is the dtrace output of the recv side when the disks are 15% and 100%
busy.
When the disks are at 15% busy:
18 76210 :tick-10s
restore_free
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 334
4096 |@@@@@@@@@@ 116
8192 | 0
restore_object
value ------------- Distribution ------------- count
4096 | 0
8192 |@ 8
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 301
32768 |@ 11
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 | 555
4096 |@ 660
8192 |@@@@@@@@@@@@@@@@@@@ 24231
16384 | 229
32768 | 8
65536 | 0
131072 |@@ 2459
262144 |@@@@@@@@@@@@@@@@@ 21684
524288 | 34
1048576 | 1
2097152 | 0
4194304 | 5
8388608 | 8
16777216 | 11
33554432 | 8
67108864 | 1
134217728 | 0
restore_write
value ------------- Distribution ------------- count
4096 | 0
8192 | 2
16384 | 188
32768 | 1
65536 | 0
131072 | 1
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23962
524288 | 214
1048576 | 1
2097152 | 0
4194304 | 5
8388608 | 8
16777216 | 11
33554432 | 8
67108864 | 1
134217728 | 0
afterwards when the disks are at 100%:
16 76210 :tick-10s
restore_free
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2067
4096 |@@@@@@@@@@@ 737
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
4096 | 0
8192 | 5
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1764
32768 |@@ 86
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 |@ 348
32768 |@ 641
65536 | 5
131072 | 0
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 22477
524288 |@ 306
1048576 | 10
2097152 | 7
4194304 | 9
8388608 | 18
16777216 | 8
33554432 | 12
67108864 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 2562
4096 |@@ 3166
8192 |@@@@@@@@@@@@@@@@@@@ 25090
16384 | 529
32768 | 6
65536 | 0
131072 | 667
262144 |@@@@@@@@@@@@@@@@ 22051
524288 | 79
1048576 | 5
2097152 | 2
4194304 | 11
8388608 | 14
16777216 | 8
33554432 | 12
67108864 | 0
Liam,
I get about the same performance as you. Our datasets are much smaller,
they tend to be in the 10's of TB and the biggest transfer has been 38T
logicalreferenced with 128k recordsize. The transport is rsh. I used
mbuffer as the transport for a while, but found it's only about 10-15%
faster than rsh which is a lot easier to automate reliably for incrementals
(for which I also use zrep).
Systems on both sides are OmniOS 151008, dual-Xeon 5645, 96GB RAM, X520,
42x4TB SAS in 7 x 6-disk RAIDZ2's. I have also used mirrored pools, both
with and without L2ARC. Network is Intel X520 with mtu=1500 and tcp tuning
similar to yours, connected through a Cisco Nexus stack. I don't get full
network bandwidth -- more like 6-7 Gb/s the last time I checked.
Transfers start at about 350MB/s and they slow down through the process,
ending in the low 200's.
http://gtf.org/trey/random/zfs/zfs11_send.png
The sending server in this case serves both COMSTAR FC and NFS, and it has
a small constant write load and is occasionally read from on FC. But
overall it is not very heavily loaded. The receiving box is otherwise
idle. There's virtually no NFS read load on the sender so the network
obytes64 is effectively the zfs send rate in MB/s.
On other datasets I have found that blocksize affects zfs send-recv
transfer rate dramatically, and zvol's with the default 8k volblocksize
have significantly lower transfer rates than 128k recordsize filesystems.
I'm also regularly syncing a zvol with 11T on disk and 8k blocks, and found
that larger incrementals tended to start at the normal transfer rate but
eventually stall for long periods of time (per zfs send -v output). I
haven't had that problem even on multi-TB incrementals on ~20-40T
128k-block datasets.
Compression ratio does not seem to affect transfer rates much, which makes
sense because the normal pool throughput is clearly not a limiting factor
on either side. Adding L2ARC and a larger arc_meta_max seems to help
somewhat with the small-block incrementals and my best guess (as a
non-expert on ZFS internals) that the stalling on incrementals is due to
running out of cached metadata.
My biggest datasets are lz4-compressed about 2x, so the biggest help to me
would be a pipe-enabled command-line compression utility using a weak, very
fast algorithm like lzjb or lz4. Anything that takes much CPU is a
non-starter on 10GbE. Does anyone have something like that (maybe Saso)
or another suggestion? :-)
-- Trey
------------------------------
Liam, I don't think that the commits I mentioned will help you, because
your restore_free() is not slow.
However, it looks like it is the receive that is the bottleneck for you.
Multiplying out the buckets, in each 10 second period you are spending
about 5 seconds in restore_read() and 5 seconds in restore_write(). All of
these restore_*() calls are made from a single thread, so that thread is
always busy.
restore_read() reads the data from the file descriptor. You can see that
half the operations take ~8us, but the other half take 128us up to 32ms.
If all the operations took 8us, performance would approximately double.
This could be due to network performance or the sender not keeping up.
Given that we know the sender can do better, network issues seem more
likely. Do you have a mbuffer on the receiving side too? That might be
something to try.
restore_write() writes the data into ZFS. There is less opportunity for
improvement here. You can look at the typical write path issues to see if
there is room for improvement, but at best this might get a 25%
improvement. Look at txg times, write throttle, amount of dirty data,
writes that wait for reads, etc.
Overall you could probably get a 2x performance improvement if the above
issues can be fixed. To get a larger performance improvement, we'd need to
look into implementing things like multi-threaded zfs receive.
--matt
Post by Liam SlusserMatt -
I did a zfs send to pipebench which resulted in the following. Please
note that my current zfs send is still going eating up 200-220MB/sec and
there is a big import job that is writing about 100MB/sec of data to the
disks. So between the two adding another zfs send resulting in the disks
being almost 100% busy. However, with that said, I was still able to pull
350MB/sec with all that other stuff running...
./pipebench -q > /dev/null
Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second
I'm sure I could get more off the disks if things weren't so busy.
I tested the zfs receive on the other system - again I have a the current
running zfs receive job which is eating 200-220MB but even with that here
is the results...
Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second
I ran the test twice with similar results. You can see the disks on the
receiving box aren't very busy (see iostat below). The 250Mw/sec is the
current zfs send/recv job that is running now. There is nothing else
running on the machine.
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11
So my guess is the zfs recv is being slow perhaps?
I ran the dtrace posted earlier in this thread by Dan, here are the
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 145
32768 | 11
65536 | 6
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 2
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 219
32768 | 20
65536 | 7
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0
I will try to grab those source patches and try again. My Illumos build
box is a VM on my desktop at work which is currently off so I'll have to
work on it tomorrow. I'll report back with the results.
thanks all!!
liam
Post by Matthew AhrensFound the bugs I was thinking of. If you don't already have these
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
Post by Matthew AhrensOh right. I have made several fixes in this area (restore_free() being
slow). Will send details tomorrow.
--matt
Post by Dan McDonaldI'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
And I noticed that restore_free() seems to take a long time. Here's a
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22500336-78e51065> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com