Discussion:
zfs send receive performance
Liam Slusser
2014-04-17 01:37:32 UTC
Permalink
All -

I am trying to replicate a very large, ~410T, zfs volume from one server to
another using zfs send / receive. I'm using zrep to handle snapshot
creation replication etc. Both servers have identical hardware with an
fiber optic 10g ethernet card directly connected between them (no switch in
the middle).

Quick rundown on the hardware...

OpenIndiana 151
zpool recordsize is 128k
Dell r720xd w/ 14 MD1200 drive shelfs (each shelf has 12 x 4T SAS) (14
raidz2 strips)
64g ram
4 x LSI 9207-8e 6gb SAS cards
Intel dual-port 10g fiber optic ethernet
1 x Samsung 840 PRO SSD 512G for L2ARC
2 (mirror) x Samsung 840 PRO SSD 256G for ZIL

I've done some tcp tuning...
/dev/tcp tcp_xmit_hiwat 1048576
/dev/tcp tcp_recv_hiwat 1048576
/dev/tcp tcp_max_buf 16777216
/dev/tcp tcp_cwnd_max 1048576

I started the replication using the build in ssh which got me around
33MB/sec. I upgraded to hpn-ssh and using the NONE cipher got me around
100MB/sec. I modified zrep to use mbuffer with a 4g buffer and got it up
to 225MB/sec.

Looking at iostat nothing is extremely exciting, each disk isn't doing a
lot of work, it says the total "busy" for the zpool is ~12%. In testing I
am able to read/write data off the zfs array at over 2GB/sec...

I tested the network link between each server and am able to move bits
across the wire at close to 10gbit.

Any hints on getting better performance? Even at 225MB/sec its going to
take over 3 weeks!! to do the initial sync. I am hoping to see numbers 3x
what i'm currently getting.

thanks!!
liam



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2014-04-17 02:29:56 UTC
Permalink
Couple of things to check:

Is the receiving system overloaded? I would expect not since you said it
is an identical system but doesn't hurt to check.

If you take the network and the receiving system out of the picture by
doing "zfs send >/dev/null" (or equivalent), how fast does that go? If it
is super fast then the problem must be ssh, or network, or mbuffer, or
receiving system, or .... Given that the sending zpool is only 12% busy, I
am guessing that sending >/dev/null should go at least 8x faster,
indicating that the problem is one of the above.

--matt
Post by Liam Slusser
All -
I am trying to replicate a very large, ~410T, zfs volume from one server
to another using zfs send / receive. I'm using zrep to handle snapshot
creation replication etc. Both servers have identical hardware with an
fiber optic 10g ethernet card directly connected between them (no switch in
the middle).
Quick rundown on the hardware...
OpenIndiana 151
zpool recordsize is 128k
Dell r720xd w/ 14 MD1200 drive shelfs (each shelf has 12 x 4T SAS) (14
raidz2 strips)
64g ram
4 x LSI 9207-8e 6gb SAS cards
Intel dual-port 10g fiber optic ethernet
1 x Samsung 840 PRO SSD 512G for L2ARC
2 (mirror) x Samsung 840 PRO SSD 256G for ZIL
I've done some tcp tuning...
/dev/tcp tcp_xmit_hiwat 1048576
/dev/tcp tcp_recv_hiwat 1048576
/dev/tcp tcp_max_buf 16777216
/dev/tcp tcp_cwnd_max 1048576
I started the replication using the build in ssh which got me around
33MB/sec. I upgraded to hpn-ssh and using the NONE cipher got me around
100MB/sec. I modified zrep to use mbuffer with a 4g buffer and got it up
to 225MB/sec.
Looking at iostat nothing is extremely exciting, each disk isn't doing a
lot of work, it says the total "busy" for the zpool is ~12%. In testing I
am able to read/write data off the zfs array at over 2GB/sec...
I tested the network link between each server and am able to move bits
across the wire at close to 10gbit.
Any hints on getting better performance? Even at 225MB/sec its going to
take over 3 weeks!! to do the initial sync. I am hoping to see numbers 3x
what i'm currently getting.
thanks!!
liam
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins
2014-04-17 04:42:08 UTC
Permalink
Post by Liam Slusser
All -
I am trying to replicate a very large, ~410T, zfs volume from one
server to another using zfs send / receive. I'm using zrep to handle
snapshot creation replication etc. Both servers have identical
hardware with an fiber optic 10g ethernet card directly connected
between them (no switch in the middle).
<snip>
Post by Liam Slusser
I started the replication using the build in ssh which got me around
33MB/sec. I upgraded to hpn-ssh and using the NONE cipher got me
around 100MB/sec. I modified zrep to use mbuffer with a 4g buffer and
got it up to 225MB/sec.
Looking at iostat nothing is extremely exciting, each disk isn't doing
a lot of work, it says the total "busy" for the zpool is ~12%. In
testing I am able to read/write data off the zfs array at over 2GB/sec...
I tested the network link between each server and am able to move bits
across the wire at close to 10gbit.
Any hints on getting better performance? Even at 225MB/sec its going
to take over 3 weeks!! to do the initial sync. I am hoping to see
numbers 3x what i'm currently getting.
In addition to Matt's suggestions, try sending a smallish (a few GB)
snapshot to a file, copy the file to the other system (in /tmp) and cat
the file through zfs receive to test the write speed.

I've had trouble getting decent speeds between 720s using Intel copper
10GE cards, but I haven't tried fibre.
--
Ian.
Dan McDonald
2014-04-17 04:52:40 UTC
Permalink
I'll be very interested to hear how people's experiences are with this.

I've a customer suffering similarly (but with a 2TB incremental snapshot-send). We had him use mbuffer already, but to no avail. During the middle of his receive, I ran this DTrace script:

#!/usr/sbin/dtrace -s

fbt:zfs:restore_*:entry
{
self->start = timestamp;
}

fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}

profile:::tick-10s
{
printa(@runtime);
clear(@runtime);
}

And I noticed that restore_free() seems to take a long time. Here's a sample graph:

0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 |@@ 31
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 408
8192 |@@@ 43
16384 |@ 12
32768 | 1
65536 | 1
131072 | 0

restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 190
32768 |@@@@@@ 34
65536 |@@ 14
131072 | 3
262144 | 1
524288 | 0

restore_free
value ------------- Distribution ------------- count
268435456 | 0
536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 13
1073741824 | 0


That's .5sec per call for 13 calls. Yike?!

Dan
Matthew Ahrens
2014-04-17 05:28:14 UTC
Permalink
Oh right. I have made several fixes in this area (restore_free() being
slow). Will send details tomorrow.

--matt
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2014-04-17 06:35:18 UTC
Permalink
Found the bugs I was thinking of. If you don't already have these fixes,
upgrading to them should help with the restore_free():

Author: Max Grossman <***@delphix.com>
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <***@delphix.com>
Reviewed by: George Wilson <***@delphix.com>
Reviewed by: Christopher Siden <***@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <***@josefsipek.net>
Approved by: Garrett D'Amore <***@damore.org>

Author: Matthew Ahrens <***@delphix.com>
Date: Tue Aug 20 20:11:52 2013 -0800

4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Reviewed by: Adam Leventhal <***@delphix.com>
Reviewed by: George Wilson <***@delphix.com>
Approved by: Dan McDonald <***@nexenta.com>

Author: Matthew Ahrens <***@delphix.com>
Date: Mon Jul 29 10:58:53 2013 -0800

3834 incremental replication of 'holey' file systems is slow
Post by Matthew Ahrens
Oh right. I have made several fixes in this area (restore_free() being
slow). Will send details tomorrow.
--matt
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Liam Slusser
2014-04-17 07:05:34 UTC
Permalink
Matt -

I did a zfs send to pipebench which resulted in the following. Please note
that my current zfs send is still going eating up 200-220MB/sec and there
is a big import job that is writing about 100MB/sec of data to the disks.
So between the two adding another zfs send resulting in the disks being
almost 100% busy. However, with that said, I was still able to pull
350MB/sec with all that other stuff running...

***@store01:/# zfs send -p data/***@zrep_000000 2> /dev/null |
./pipebench -q > /dev/null
Summary:

Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second

I'm sure I could get more off the disks if things weren't so busy.

I tested the zfs receive on the other system - again I have a the current
running zfs receive job which is eating 200-220MB but even with that here
is the results...

***@store02:/# cat /tmp/test-now | ./pipebench -q | zfs recv -F data/test
Summary:

Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second

I ran the test twice with similar results. You can see the disks on the
receiving box aren't very busy (see iostat below). The 250Mw/sec is the
current zfs send/recv job that is running now. There is nothing else
running on the machine.

***@store02:/# iostat -x -M data 5
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11

So my guess is the zfs recv is being slow perhaps?

I ran the dtrace posted earlier in this thread by Dan, here are the results
of that:

17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@ 19
16384 |@@@@@@@@@@@@@ 11
32768 |@@@@ 3
65536 | 0

restore_free
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 423
4096 |@@@@@@@@@@@@@@ 224
8192 | 3
16384 | 0

restore_object
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 345
32768 |@@@ 33
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0

restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 |@ 667
4096 |@@ 1595
8192 |@@@@@@@@@@@@@@@@@@ 14692
16384 | 145
32768 | 11
65536 | 6
131072 |@@@@@@@ 5886
262144 |@@@@@@@@@ 7305
524288 |@ 486
1048576 |@ 894
2097152 |@ 786
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0

restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12903
524288 |@@ 722
1048576 |@@ 862
2097152 |@@ 854
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0


17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 56
16384 |@@@@@@@@ 14
32768 |@@ 3
65536 | 0

restore_free
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 665
4096 |@@@@@@@@@@ 209
8192 | 2
16384 | 0

restore_object
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 446
32768 |@@ 20
65536 | 0

restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 |@ 966
4096 |@@ 1708
8192 |@@@@@@@@@@@@@@@@@@ 15003
16384 | 219
32768 | 20
65536 | 7
131072 |@@@@@@@ 5738
262144 |@@@@@@@@@ 7713
524288 |@ 587
1048576 |@ 872
2097152 |@ 770
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0

restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 13166
524288 |@@ 791
1048576 |@@ 858
2097152 |@@ 844
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0

I will try to grab those source patches and try again. My Illumos build
box is a VM on my desktop at work which is currently off so I'll have to
work on it tomorrow. I'll report back with the results.

thanks all!!
liam
Post by Matthew Ahrens
Found the bugs I was thinking of. If you don't already have these fixes,
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
Post by Matthew Ahrens
Oh right. I have made several fixes in this area (restore_free() being
slow). Will send details tomorrow.
--matt
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins
2014-04-17 08:58:27 UTC
Permalink
Post by Liam Slusser
Matt -
<snip>
Post by Liam Slusser
So my guess is the zfs recv is being slow perhaps?
I ran the dtrace posted earlier in this thread by Dan, here are the
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
As you can see, your restore_free results are quite normal compared to
Dan's!
--
Ian.
Matthew Ahrens
2014-04-17 16:10:16 UTC
Permalink
Liam, I don't think that the commits I mentioned will help you, because
your restore_free() is not slow.

However, it looks like it is the receive that is the bottleneck for you.
Multiplying out the buckets, in each 10 second period you are spending
about 5 seconds in restore_read() and 5 seconds in restore_write(). All of
these restore_*() calls are made from a single thread, so that thread is
always busy.

restore_read() reads the data from the file descriptor. You can see that
half the operations take ~8us, but the other half take 128us up to 32ms.
If all the operations took 8us, performance would approximately double.
This could be due to network performance or the sender not keeping up.
Given that we know the sender can do better, network issues seem more
likely. Do you have a mbuffer on the receiving side too? That might be
something to try.

restore_write() writes the data into ZFS. There is less opportunity for
improvement here. You can look at the typical write path issues to see if
there is room for improvement, but at best this might get a 25%
improvement. Look at txg times, write throttle, amount of dirty data,
writes that wait for reads, etc.

Overall you could probably get a 2x performance improvement if the above
issues can be fixed. To get a larger performance improvement, we'd need to
look into implementing things like multi-threaded zfs receive.

--matt
Post by Liam Slusser
Matt -
I did a zfs send to pipebench which resulted in the following. Please
note that my current zfs send is still going eating up 200-220MB/sec and
there is a big import job that is writing about 100MB/sec of data to the
disks. So between the two adding another zfs send resulting in the disks
being almost 100% busy. However, with that said, I was still able to pull
350MB/sec with all that other stuff running...
./pipebench -q > /dev/null
Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second
I'm sure I could get more off the disks if things weren't so busy.
I tested the zfs receive on the other system - again I have a the current
running zfs receive job which is eating 200-220MB but even with that here
is the results...
Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second
I ran the test twice with similar results. You can see the disks on the
receiving box aren't very busy (see iostat below). The 250Mw/sec is the
current zfs send/recv job that is running now. There is nothing else
running on the machine.
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11
So my guess is the zfs recv is being slow perhaps?
I ran the dtrace posted earlier in this thread by Dan, here are the
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 145
32768 | 11
65536 | 6
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 2
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 219
32768 | 20
65536 | 7
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0
I will try to grab those source patches and try again. My Illumos build
box is a VM on my desktop at work which is currently off so I'll have to
work on it tomorrow. I'll report back with the results.
thanks all!!
liam
Post by Matthew Ahrens
Found the bugs I was thinking of. If you don't already have these fixes,
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
Post by Matthew Ahrens
Oh right. I have made several fixes in this area (restore_free() being
slow). Will send details tomorrow.
--matt
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Trey Palmer
2014-04-17 20:27:12 UTC
Permalink
Liam,

I get about the same performance as you. Our datasets are much smaller, they tend to be in the 10's of TB and the biggest transfer has been 38T logicalreferenced with 128k recordsize. The transport is rsh. I used mbuffer as the transport for a while, but found it's only about 10-15% faster than rsh which is a lot easier to automate reliably for incrementals (for which I also use zrep).

Systems on both sides are OmniOS 151008, dual-Xeon 5645, 96GB RAM, X520, 42x4TB SAS in 7 x 6-disk RAIDZ2's. I have also used mirrored pools, both with and without L2ARC. Network is Intel X520 with mtu=1500 and tcp tuning similar to yours, connected through a Cisco Nexus stack. I don't get full network bandwidth -- more like 6-7 Gb/s the last time I checked.

Transfers start at about 350MB/s and they slow down through the process, ending in the low 200's.

Here are some Graphite graphs from the sending side of the 38T transfer:

Loading Image...

The sending server in this case serves both COMSTAR FC and NFS, and it has a small constant write load and is occasionally read from on FC. But overall it is not very heavily loaded. The receiving box is otherwise idle. There's virtually no NFS read load on the sender so the network obytes64 is effectively the zfs send rate in MB/s.

On other datasets I have found that blocksize affects zfs send-recv transfer rate dramatically, and zvol's with the default 8k volblocksize have significantly lower transfer rates than 128k recordsize filesystems. I'm also regularly syncing a zvol with 11T on disk and 8k blocks, and found that larger incrementals tended to start at the normal transfer rate but eventually stall for long periods of time (per zfs send -v output). I haven't had that problem even on multi-TB incrementals on ~20-40T 128k-block datasets.

Compression ratio does not seem to affect transfer rates much, which makes sense because the normal pool throughput is clearly not a limiting factor on either side. Adding L2ARC and a larger arc_meta_max seems to help somewhat with the small-block incrementals and my best guess (as a non-expert on ZFS internals) that the stalling on incrementals is due to running out of cached metadata.

My biggest datasets are lz4-compressed about 2x, so the biggest help to me would be a pipe-enabled command-line compression utility using a weak, very fast algorithm like lzjb or lz4. Anything that takes much CPU is a non-starter on 10GbE. Does anyone have something like that (maybe Saso) or another suggestion? :-)

-- Trey

----- Original Message -----
Post by Matthew Ahrens
Liam, I don't think that the commits I mentioned will help you,
because your restore_free() is not slow.
However, it looks like it is the receive that is the bottleneck for
you. Multiplying out the buckets, in each 10 second period you are
spending about 5 seconds in restore_read() and 5 seconds in
restore_write(). All of these restore_*() calls are made from a
single thread, so that thread is always busy.
restore_read() reads the data from the file descriptor. You can see
that half the operations take ~8us, but the other half take 128us up
to 32ms. If all the operations took 8us, performance would
approximately double. This could be due to network performance or
the sender not keeping up. Given that we know the sender can do
better, network issues seem more likely. Do you have a mbuffer on
the receiving side too? That might be something to try.
restore_write() writes the data into ZFS. There is less opportunity
for improvement here. You can look at the typical write path issues
to see if there is room for improvement, but at best this might get
a 25% improvement. Look at txg times, write throttle, amount of
dirty data, writes that wait for reads, etc.
Overall you could probably get a 2x performance improvement if the
above issues can be fixed. To get a larger performance improvement,
we'd need to look into implementing things like multi-threaded zfs
receive.
--matt
On Thu, Apr 17, 2014 at 12:05 AM, Liam Slusser <
Post by Liam Slusser
Matt -
I did a zfs send to pipebench which resulted in the following.
Please
note that my current zfs send is still going eating up
200-220MB/sec
and there is a big import job that is writing about 100MB/sec of
data to the disks. So between the two adding another zfs send
resulting in the disks being almost 100% busy. However, with that
said, I was still able to pull 350MB/sec with all that other stuff
running...
./pipebench -q > /dev/null
Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second
I'm sure I could get more off the disks if things weren't so busy.
I tested the zfs receive on the other system - again I have a the
current running zfs receive job which is eating 200-220MB but even
with that here is the results...
Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second
I ran the test twice with similar results. You can see the disks on
the receiving box aren't very busy (see iostat below). The
250Mw/sec
is the current zfs send/recv job that is running now. There is
nothing else running on the machine.
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11
So my guess is the zfs recv is being slow perhaps?
I ran the dtrace posted earlier in this thread by Dan, here are the
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 145
32768 | 11
65536 | 6
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 2
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 219
32768 | 20
65536 | 7
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0
I will try to grab those source patches and try again. My Illumos
build box is a VM on my desktop at work which is currently off so
I'll have to work on it tomorrow. I'll report back with the
results.
thanks all!!
liam
On Wed, Apr 16, 2014 at 11:35 PM, Matthew Ahrens <
Found the bugs I was thinking of. If you don't already have these
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
On Wed, Apr 16, 2014 at 10:28 PM, Matthew Ahrens <
Oh right. I have made several fixes in this area
(restore_free()
being slow). Will send details tomorrow.
--matt
On Wed, Apr 16, 2014 at 9:52 PM, Dan McDonald <
Post by Dan McDonald
I'll be very interested to hear how people's experiences are
with
this.
I've a customer suffering similarly (but with a 2TB
incremental
snapshot-send). We had him use mbuffer already, but to no
avail.
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
And I noticed that restore_free() seems to take a long time.
Here's
a
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Liam Slusser
2014-04-18 00:02:35 UTC
Permalink
Trey -

Your setup is similar to mine. I have the exact same 10g card, the Intel
X520 (X520-SR2 in my case), with the same MTU. Have you tried to increase
the MTU to 9000? Not sure if mbuffer would take advantage of that or not.

My server just does NFS server duty.

Google has a lz4 command line utility that will do stdin/stdout lz4
compression. You should give that a go. https://code.google.com/p/lz4/

My data is all audio files in various codecs which aren't compressible. So
using compression would just result in wasted cpu cycles.


Matt,

I was running mbuffer on both sides, but I had a different block size on
the receiving side which i've fixed. That helped a bit. I've noticed that
the sending side buffer is at 100%, but the receiving side buffer is always
at 0%.

Looks like a network problem, so I played around with different block sizes
in mbuffer...

1M = 200MB/sec
128k = 210MB/sec
64k = 319MB/sec
32k = 412MB/sec
16k = 350MB/sec

So it looks like, at least on Illumos, a block size of 32k results in the
best performance. The sweet spot seems to be a 32k block size and a 1G
buffer, both 512M and 2G buffer result in slower performance for whatever
reason.

Initially my zpool is at about 15% busy, with the sender buffer at 100% and
the recv buffer at 0%, but after a few minutes into the transfer the busy
jumps to 100% and the recv buffer creeps up to 100%. So its now waiting on
the disks and the transfer slows to around 350MB/sec

device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.0 4507.9 0.0 384.1 2672.3 172.6 631.1 13 15
data 0.0 4644.8 0.0 391.8 2549.9 165.1 584.5 12 14
data 0.0 4706.7 0.0 397.7 2545.1 170.5 577.0 14 15
data 0.0 4594.4 0.0 392.2 2769.0 172.2 640.2 14 16
data 0.0 4488.8 0.0 379.4 2496.2 167.7 593.5 13 14
data 0.0 4685.9 0.0 390.7 2622.7 173.5 596.7 14 16
data 0.0 4704.3 0.0 406.4 2751.5 179.7 623.1 14 15
data 0.0 4501.0 0.0 389.1 2844.6 172.0 670.2 13 14
data 0.0 4869.8 0.0 410.2 2716.0 170.7 592.8 16 26
data 0.4 4788.7 0.0 422.6 2668.0 169.1 592.4 21 42
data 1.2 4811.2 0.0 428.0 2126.1 170.5 477.2 19 47
data 1.0 4880.7 0.0 419.4 2335.0 170.0 513.1 30 60
data 7.4 4472.1 0.0 391.3 2577.5 183.1 616.3 85 100
data 6.0 4304.9 0.0 385.0 2884.7 181.8 711.3 82 100
data 8.8 4537.0 0.0 393.3 2936.5 183.1 686.3 87 100
data 12.8 4641.4 0.0 392.8 2875.0 178.3 656.0 88 100
data 15.2 4625.7 0.0 395.5 3145.5 178.9 716.3 87 100
data 13.0 4329.9 0.0 382.0 2720.2 179.5 667.7 84 100
data 13.0 4668.2 0.0 418.1 3213.0 194.0 727.8 84 100
data 26.5 4470.0 0.0 389.8 2754.4 184.9 653.7 89 100
data 10.2 4834.5 0.0 395.5 2569.6 182.5 568.1 83 100
data 15.8 4405.2 0.0 398.2 2919.4 190.8 703.5 87 100
data 20.4 4577.7 0.0 388.2 2744.7 180.0 636.1 86 100

Looks like there are a few number of reads that start at the same time the
busy increases... but I wouldn't think 15 reads/sec would cause that much
usage? I tested it a few times, anytime there is some reads the busy time
increases. weird.

Here is the dtrace output of the recv side when the disks are 15% and 100%
busy.

When the disks are at 15% busy:

18 76210 :tick-10s
restore_free
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 334
4096 |@@@@@@@@@@ 116
8192 | 0

restore_object
value ------------- Distribution ------------- count
4096 | 0
8192 |@ 8
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 301
32768 |@ 11
65536 | 0

restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 | 555
4096 |@ 660
8192 |@@@@@@@@@@@@@@@@@@@ 24231
16384 | 229
32768 | 8
65536 | 0
131072 |@@ 2459
262144 |@@@@@@@@@@@@@@@@@ 21684
524288 | 34
1048576 | 1
2097152 | 0
4194304 | 5
8388608 | 8
16777216 | 11
33554432 | 8
67108864 | 1
134217728 | 0

restore_write
value ------------- Distribution ------------- count
4096 | 0
8192 | 2
16384 | 188
32768 | 1
65536 | 0
131072 | 1
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23962
524288 | 214
1048576 | 1
2097152 | 0
4194304 | 5
8388608 | 8
16777216 | 11
33554432 | 8
67108864 | 1
134217728 | 0

afterwards when the disks are at 100%:

16 76210 :tick-10s
restore_free
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2067
4096 |@@@@@@@@@@@ 737
8192 | 3
16384 | 0

restore_object
value ------------- Distribution ------------- count
4096 | 0
8192 | 5
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1764
32768 |@@ 86
65536 | 1
131072 | 0

restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 |@ 348
32768 |@ 641
65536 | 5
131072 | 0
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 22477
524288 |@ 306
1048576 | 10
2097152 | 7
4194304 | 9
8388608 | 18
16777216 | 8
33554432 | 12
67108864 | 0

restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 2562
4096 |@@ 3166
8192 |@@@@@@@@@@@@@@@@@@@ 25090
16384 | 529
32768 | 6
65536 | 0
131072 | 667
262144 |@@@@@@@@@@@@@@@@ 22051
524288 | 79
1048576 | 5
2097152 | 2
4194304 | 11
8388608 | 14
16777216 | 8
33554432 | 12
67108864 | 0
Liam,
I get about the same performance as you. Our datasets are much smaller,
they tend to be in the 10's of TB and the biggest transfer has been 38T
logicalreferenced with 128k recordsize. The transport is rsh. I used
mbuffer as the transport for a while, but found it's only about 10-15%
faster than rsh which is a lot easier to automate reliably for incrementals
(for which I also use zrep).
Systems on both sides are OmniOS 151008, dual-Xeon 5645, 96GB RAM, X520,
42x4TB SAS in 7 x 6-disk RAIDZ2's. I have also used mirrored pools, both
with and without L2ARC. Network is Intel X520 with mtu=1500 and tcp tuning
similar to yours, connected through a Cisco Nexus stack. I don't get full
network bandwidth -- more like 6-7 Gb/s the last time I checked.
Transfers start at about 350MB/s and they slow down through the process,
ending in the low 200's.
http://gtf.org/trey/random/zfs/zfs11_send.png
The sending server in this case serves both COMSTAR FC and NFS, and it has
a small constant write load and is occasionally read from on FC. But
overall it is not very heavily loaded. The receiving box is otherwise
idle. There's virtually no NFS read load on the sender so the network
obytes64 is effectively the zfs send rate in MB/s.
On other datasets I have found that blocksize affects zfs send-recv
transfer rate dramatically, and zvol's with the default 8k volblocksize
have significantly lower transfer rates than 128k recordsize filesystems.
I'm also regularly syncing a zvol with 11T on disk and 8k blocks, and found
that larger incrementals tended to start at the normal transfer rate but
eventually stall for long periods of time (per zfs send -v output). I
haven't had that problem even on multi-TB incrementals on ~20-40T
128k-block datasets.
Compression ratio does not seem to affect transfer rates much, which makes
sense because the normal pool throughput is clearly not a limiting factor
on either side. Adding L2ARC and a larger arc_meta_max seems to help
somewhat with the small-block incrementals and my best guess (as a
non-expert on ZFS internals) that the stalling on incrementals is due to
running out of cached metadata.
My biggest datasets are lz4-compressed about 2x, so the biggest help to me
would be a pipe-enabled command-line compression utility using a weak, very
fast algorithm like lzjb or lz4. Anything that takes much CPU is a
non-starter on 10GbE. Does anyone have something like that (maybe Saso)
or another suggestion? :-)
-- Trey
------------------------------
Liam, I don't think that the commits I mentioned will help you, because
your restore_free() is not slow.
However, it looks like it is the receive that is the bottleneck for you.
Multiplying out the buckets, in each 10 second period you are spending
about 5 seconds in restore_read() and 5 seconds in restore_write(). All of
these restore_*() calls are made from a single thread, so that thread is
always busy.
restore_read() reads the data from the file descriptor. You can see that
half the operations take ~8us, but the other half take 128us up to 32ms.
If all the operations took 8us, performance would approximately double.
This could be due to network performance or the sender not keeping up.
Given that we know the sender can do better, network issues seem more
likely. Do you have a mbuffer on the receiving side too? That might be
something to try.
restore_write() writes the data into ZFS. There is less opportunity for
improvement here. You can look at the typical write path issues to see if
there is room for improvement, but at best this might get a 25%
improvement. Look at txg times, write throttle, amount of dirty data,
writes that wait for reads, etc.
Overall you could probably get a 2x performance improvement if the above
issues can be fixed. To get a larger performance improvement, we'd need to
look into implementing things like multi-threaded zfs receive.
--matt
Post by Liam Slusser
Matt -
I did a zfs send to pipebench which resulted in the following. Please
note that my current zfs send is still going eating up 200-220MB/sec and
there is a big import job that is writing about 100MB/sec of data to the
disks. So between the two adding another zfs send resulting in the disks
being almost 100% busy. However, with that said, I was still able to pull
350MB/sec with all that other stuff running...
./pipebench -q > /dev/null
Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second
I'm sure I could get more off the disks if things weren't so busy.
I tested the zfs receive on the other system - again I have a the current
running zfs receive job which is eating 200-220MB but even with that here
is the results...
Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second
I ran the test twice with similar results. You can see the disks on the
receiving box aren't very busy (see iostat below). The 250Mw/sec is the
current zfs send/recv job that is running now. There is nothing else
running on the machine.
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11
So my guess is the zfs recv is being slow perhaps?
I ran the dtrace posted earlier in this thread by Dan, here are the
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 145
32768 | 11
65536 | 6
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 2
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 219
32768 | 20
65536 | 7
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0
I will try to grab those source patches and try again. My Illumos build
box is a VM on my desktop at work which is currently off so I'll have to
work on it tomorrow. I'll report back with the results.
thanks all!!
liam
Post by Matthew Ahrens
Found the bugs I was thinking of. If you don't already have these
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
Post by Matthew Ahrens
Oh right. I have made several fixes in this area (restore_free() being
slow). Will send details tomorrow.
--matt
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
And I noticed that restore_free() seems to take a long time. Here's a
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22500336-78e51065> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Liam Slusser
2014-04-18 00:05:22 UTC
Permalink
Trey -

Here is another version thats mutlithreaded, although I haven't tried to
compile it on Illumos.

https://github.com/t-mat/lz4mt

Let me/us all know if it works out.

thanks!
liam
Post by Liam Slusser
Trey -
Your setup is similar to mine. I have the exact same 10g card, the Intel
X520 (X520-SR2 in my case), with the same MTU. Have you tried to increase
the MTU to 9000? Not sure if mbuffer would take advantage of that or not.
My server just does NFS server duty.
Google has a lz4 command line utility that will do stdin/stdout lz4
compression. You should give that a go. https://code.google.com/p/lz4/
My data is all audio files in various codecs which aren't compressible.
So using compression would just result in wasted cpu cycles.
Matt,
I was running mbuffer on both sides, but I had a different block size on
the receiving side which i've fixed. That helped a bit. I've noticed that
the sending side buffer is at 100%, but the receiving side buffer is always
at 0%.
Looks like a network problem, so I played around with different block
sizes in mbuffer...
1M = 200MB/sec
128k = 210MB/sec
64k = 319MB/sec
32k = 412MB/sec
16k = 350MB/sec
So it looks like, at least on Illumos, a block size of 32k results in the
best performance. The sweet spot seems to be a 32k block size and a 1G
buffer, both 512M and 2G buffer result in slower performance for whatever
reason.
Initially my zpool is at about 15% busy, with the sender buffer at 100%
and the recv buffer at 0%, but after a few minutes into the transfer the
busy jumps to 100% and the recv buffer creeps up to 100%. So its now
waiting on the disks and the transfer slows to around 350MB/sec
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.0 4507.9 0.0 384.1 2672.3 172.6 631.1 13 15
data 0.0 4644.8 0.0 391.8 2549.9 165.1 584.5 12 14
data 0.0 4706.7 0.0 397.7 2545.1 170.5 577.0 14 15
data 0.0 4594.4 0.0 392.2 2769.0 172.2 640.2 14 16
data 0.0 4488.8 0.0 379.4 2496.2 167.7 593.5 13 14
data 0.0 4685.9 0.0 390.7 2622.7 173.5 596.7 14 16
data 0.0 4704.3 0.0 406.4 2751.5 179.7 623.1 14 15
data 0.0 4501.0 0.0 389.1 2844.6 172.0 670.2 13 14
data 0.0 4869.8 0.0 410.2 2716.0 170.7 592.8 16 26
data 0.4 4788.7 0.0 422.6 2668.0 169.1 592.4 21 42
data 1.2 4811.2 0.0 428.0 2126.1 170.5 477.2 19 47
data 1.0 4880.7 0.0 419.4 2335.0 170.0 513.1 30 60
data 7.4 4472.1 0.0 391.3 2577.5 183.1 616.3 85 100
data 6.0 4304.9 0.0 385.0 2884.7 181.8 711.3 82 100
data 8.8 4537.0 0.0 393.3 2936.5 183.1 686.3 87 100
data 12.8 4641.4 0.0 392.8 2875.0 178.3 656.0 88 100
data 15.2 4625.7 0.0 395.5 3145.5 178.9 716.3 87 100
data 13.0 4329.9 0.0 382.0 2720.2 179.5 667.7 84 100
data 13.0 4668.2 0.0 418.1 3213.0 194.0 727.8 84 100
data 26.5 4470.0 0.0 389.8 2754.4 184.9 653.7 89 100
data 10.2 4834.5 0.0 395.5 2569.6 182.5 568.1 83 100
data 15.8 4405.2 0.0 398.2 2919.4 190.8 703.5 87 100
data 20.4 4577.7 0.0 388.2 2744.7 180.0 636.1 86 100
Looks like there are a few number of reads that start at the same time the
busy increases... but I wouldn't think 15 reads/sec would cause that much
usage? I tested it a few times, anytime there is some reads the busy time
increases. weird.
Here is the dtrace output of the recv side when the disks are 15% and 100%
busy.
18 76210 :tick-10s
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 0
restore_object
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
2048 | 555
16384 | 229
32768 | 8
65536 | 0
524288 | 34
1048576 | 1
2097152 | 0
4194304 | 5
8388608 | 8
16777216 | 11
33554432 | 8
67108864 | 1
134217728 | 0
restore_write
value ------------- Distribution ------------- count
4096 | 0
8192 | 2
16384 | 188
32768 | 1
65536 | 0
131072 | 1
524288 | 214
1048576 | 1
2097152 | 0
4194304 | 5
8388608 | 8
16777216 | 11
33554432 | 8
67108864 | 1
134217728 | 0
16 76210 :tick-10s
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
4096 | 0
8192 | 5
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
65536 | 5
131072 | 0
1048576 | 10
2097152 | 7
4194304 | 9
8388608 | 18
16777216 | 8
33554432 | 12
67108864 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 529
32768 | 6
65536 | 0
131072 | 667
524288 | 79
1048576 | 5
2097152 | 2
4194304 | 11
8388608 | 14
16777216 | 8
33554432 | 12
67108864 | 0
Liam,
I get about the same performance as you. Our datasets are much smaller,
they tend to be in the 10's of TB and the biggest transfer has been 38T
logicalreferenced with 128k recordsize. The transport is rsh. I used
mbuffer as the transport for a while, but found it's only about 10-15%
faster than rsh which is a lot easier to automate reliably for incrementals
(for which I also use zrep).
Systems on both sides are OmniOS 151008, dual-Xeon 5645, 96GB RAM, X520,
42x4TB SAS in 7 x 6-disk RAIDZ2's. I have also used mirrored pools, both
with and without L2ARC. Network is Intel X520 with mtu=1500 and tcp tuning
similar to yours, connected through a Cisco Nexus stack. I don't get full
network bandwidth -- more like 6-7 Gb/s the last time I checked.
Transfers start at about 350MB/s and they slow down through the process,
ending in the low 200's.
http://gtf.org/trey/random/zfs/zfs11_send.png
The sending server in this case serves both COMSTAR FC and NFS, and it
has a small constant write load and is occasionally read from on FC. But
overall it is not very heavily loaded. The receiving box is otherwise
idle. There's virtually no NFS read load on the sender so the network
obytes64 is effectively the zfs send rate in MB/s.
On other datasets I have found that blocksize affects zfs send-recv
transfer rate dramatically, and zvol's with the default 8k volblocksize
have significantly lower transfer rates than 128k recordsize filesystems.
I'm also regularly syncing a zvol with 11T on disk and 8k blocks, and found
that larger incrementals tended to start at the normal transfer rate but
eventually stall for long periods of time (per zfs send -v output). I
haven't had that problem even on multi-TB incrementals on ~20-40T
128k-block datasets.
Compression ratio does not seem to affect transfer rates much, which
makes sense because the normal pool throughput is clearly not a limiting
factor on either side. Adding L2ARC and a larger arc_meta_max seems to
help somewhat with the small-block incrementals and my best guess (as a
non-expert on ZFS internals) that the stalling on incrementals is due to
running out of cached metadata.
My biggest datasets are lz4-compressed about 2x, so the biggest help to
me would be a pipe-enabled command-line compression utility using a weak,
very fast algorithm like lzjb or lz4. Anything that takes much CPU is a
non-starter on 10GbE. Does anyone have something like that (maybe Saso)
or another suggestion? :-)
-- Trey
------------------------------
Liam, I don't think that the commits I mentioned will help you, because
your restore_free() is not slow.
However, it looks like it is the receive that is the bottleneck for you.
Multiplying out the buckets, in each 10 second period you are spending
about 5 seconds in restore_read() and 5 seconds in restore_write(). All of
these restore_*() calls are made from a single thread, so that thread is
always busy.
restore_read() reads the data from the file descriptor. You can see that
half the operations take ~8us, but the other half take 128us up to 32ms.
If all the operations took 8us, performance would approximately double.
This could be due to network performance or the sender not keeping up.
Given that we know the sender can do better, network issues seem more
likely. Do you have a mbuffer on the receiving side too? That might be
something to try.
restore_write() writes the data into ZFS. There is less opportunity for
improvement here. You can look at the typical write path issues to see if
there is room for improvement, but at best this might get a 25%
improvement. Look at txg times, write throttle, amount of dirty data,
writes that wait for reads, etc.
Overall you could probably get a 2x performance improvement if the above
issues can be fixed. To get a larger performance improvement, we'd need to
look into implementing things like multi-threaded zfs receive.
--matt
Post by Liam Slusser
Matt -
I did a zfs send to pipebench which resulted in the following. Please
note that my current zfs send is still going eating up 200-220MB/sec and
there is a big import job that is writing about 100MB/sec of data to the
disks. So between the two adding another zfs send resulting in the disks
being almost 100% busy. However, with that said, I was still able to pull
350MB/sec with all that other stuff running...
./pipebench -q > /dev/null
Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second
I'm sure I could get more off the disks if things weren't so busy.
I tested the zfs receive on the other system - again I have a the
current running zfs receive job which is eating 200-220MB but even with
that here is the results...
Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second
I ran the test twice with similar results. You can see the disks on the
receiving box aren't very busy (see iostat below). The 250Mw/sec is the
current zfs send/recv job that is running now. There is nothing else
running on the machine.
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11
So my guess is the zfs recv is being slow perhaps?
I ran the dtrace posted earlier in this thread by Dan, here are the
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 145
32768 | 11
65536 | 6
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 2
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 219
32768 | 20
65536 | 7
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0
I will try to grab those source patches and try again. My Illumos build
box is a VM on my desktop at work which is currently off so I'll have to
work on it tomorrow. I'll report back with the results.
thanks all!!
liam
Post by Matthew Ahrens
Found the bugs I was thinking of. If you don't already have these
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
Post by Matthew Ahrens
Oh right. I have made several fixes in this area (restore_free()
being slow). Will send details tomorrow.
--matt
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with this.
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail. During
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
And I noticed that restore_free() seems to take a long time. Here's
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc>|
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22500336-78e51065> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip
2014-04-18 13:47:21 UTC
Permalink
Post by Trey Palmer
http://gtf.org/trey/random/zfs/zfs11_send.png
Trey,

Like your Graphite graphs. Do you have any scripts that you can share
that allow you to push the ZFS stats to Graphite?

You can email me offline if you like.

-Chip
chip (at) innovates.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2014-04-17 23:47:30 UTC
Permalink
Liam, I don't think that the commits I mentioned will help you, because your restore_free() is not slow.
However, it looks like it is the receive that is the bottleneck for you. Multiplying out the buckets, in each 10 second period you are spending about 5 seconds in restore_read() and 5 seconds in restore_write(). All of these restore_*() calls are made from a single thread, so that thread is always busy.
restore_read() reads the data from the file descriptor. You can see that half the operations take ~8us, but the other half take 128us up to 32ms. If all the operations took 8us, performance would approximately double. This could be due to network performance or the sender not keeping up. Given that we know the sender can do better, network issues seem more likely. Do you have a mbuffer on the receiving side too? That might be something to try.
In general, you should start with mbuffer on the receiver first and only
add one to the sender if the receiver's mbuffer is always empty.
-- richard
restore_write() writes the data into ZFS. There is less opportunity for improvement here. You can look at the typical write path issues to see if there is room for improvement, but at best this might get a 25% improvement. Look at txg times, write throttle, amount of dirty data, writes that wait for reads, etc.
Overall you could probably get a 2x performance improvement if the above issues can be fixed. To get a larger performance improvement, we'd need to look into implementing things like multi-threaded zfs receive.
--matt
Matt -
I did a zfs send to pipebench which resulted in the following. Please note that my current zfs send is still going eating up 200-220MB/sec and there is a big import job that is writing about 100MB/sec of data to the disks. So between the two adding another zfs send resulting in the disks being almost 100% busy. However, with that said, I was still able to pull 350MB/sec with all that other stuff running...
Piped 10.69 GB in 00h00m31.10s: 351.94 MB/second
I'm sure I could get more off the disks if things weren't so busy.
I tested the zfs receive on the other system - again I have a the current running zfs receive job which is eating 200-220MB but even with that here is the results...
Piped 2.34 GB in 00h00m14.92s: 160.65 MB/second
I ran the test twice with similar results. You can see the disks on the receiving box aren't very busy (see iostat below). The 250Mw/sec is the current zfs send/recv job that is running now. There is nothing else running on the machine.
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.8 3529.7 0.0 259.1 1078.2 104.5 335.0 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 3389.4 0.0 272.3 1088.0 110.6 353.6 11 13
device r/s w/s Mr/s Mw/s wait actv svc_t %w %b
data 0.4 2964.7 0.0 241.1 1101.1 98.6 404.6 10 11
So my guess is the zfs recv is being slow perhaps?
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 3
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 145
32768 | 11
65536 | 6
4194304 | 41
8388608 | 21
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 113
32768 | 13
65536 | 2
131072 | 0
4194304 | 44
8388608 | 23
16777216 | 13
33554432 | 4
67108864 | 0
17 76210 :tick-10s
restore_freeobjects
value ------------- Distribution ------------- count
4096 | 0
65536 | 0
restore_free
value ------------- Distribution ------------- count
1024 | 0
8192 | 2
16384 | 0
restore_object
value ------------- Distribution ------------- count
8192 | 0
65536 | 0
restore_read
value ------------- Distribution ------------- count
1024 | 0
16384 | 219
32768 | 20
65536 | 7
4194304 | 25
8388608 | 13
16777216 | 14
33554432 | 4
67108864 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
16384 | 39
32768 | 118
65536 | 7
131072 | 2
4194304 | 31
8388608 | 14
16777216 | 15
33554432 | 4
67108864 | 0
I will try to grab those source patches and try again. My Illumos build box is a VM on my desktop at work which is currently off so I'll have to work on it tomorrow. I'll report back with the results.
thanks all!!
liam
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
Oh right. I have made several fixes in this area (restore_free() being slow). Will send details tomorrow.
--matt
I'll be very interested to hear how people's experiences are with this.
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription
Andrew Gabriel
2014-04-19 14:11:17 UTC
Permalink
Post by Richard Elling
In general, you should start with mbuffer on the receiver first and only
add one to the sender if the receiver's mbuffer is always empty.
-- richard
This is what I've been doing for a number of years (although using a
program of my own, rather than mbuffer).
The buffer size should be the network bandwidth times the transaction
commit interval time (typically 5-10 seconds).

It nicely overcomes the receiver's chunky reading, which appears to do a
gigantic read after each transaction commit. This keeps the network
streaming, rather than an alternating sequence of network, then txg
commit, then network, then txg commit, etc. This effect is most
noticeable when the network bandwidth and the receiver's write bandwidth
are similar.
--
Andrew Gabriel
Dan McDonald
2014-04-17 13:49:19 UTC
Permalink
Thank you! My customer is on OmniOS 151008, which does not include these fixes (but OmniOS will soon enough).

Dan

Sent from my iPhone (typos, autocorrect, and all)
Jim Klimov
2014-04-18 06:41:29 UTC
Permalink
Post by Matthew Ahrens
Found the bugs I was thinking of. If you don't already have these fixes,
Date: Mon Dec 9 10:37:51 2013 -0800
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Date: Tue Aug 20 20:11:52 2013 -0800
4047 panic from dbuf_free_range() from dmu_free_object() while doing
zfs receive
Date: Mon Jul 29 10:58:53 2013 -0800
3834 incremental replication of 'holey' file systems is slow
On Wed, Apr 16, 2014 at 10:28 PM, Matthew Ahrens
Post by Matthew Ahrens
Oh right. I have made several fixes in this area (restore_free()
being
Post by Matthew Ahrens
slow). Will send details tomorrow.
--matt
On Wed, Apr 16, 2014 at 9:52 PM, Dan McDonald
Post by Dan McDonald
I'll be very interested to hear how people's experiences are with
this.
Post by Matthew Ahrens
Post by Dan McDonald
I've a customer suffering similarly (but with a 2TB incremental
snapshot-send). We had him use mbuffer already, but to no avail.
During
Post by Matthew Ahrens
Post by Dan McDonald
#!/usr/sbin/dtrace -s
fbt:zfs:restore_*:entry
{
self->start = timestamp;
}
fbt:zfs:restore_*:return
{
@runtime[probefunc] = quantize(timestamp - self->start);
}
profile:::tick-10s
{
}
And I noticed that restore_free() seems to take a long time. Here's
a
Post by Matthew Ahrens
Post by Dan McDonald
0 80473 :tick-10s
restore_read
value ------------- Distribution ------------- count
512 | 0
1024 | 1
32768 | 1
65536 | 1
131072 | 0
restore_write
value ------------- Distribution ------------- count
8192 | 0
131072 | 3
262144 | 1
524288 | 0
restore_free
value ------------- Distribution ------------- count
268435456 | 0
1073741824 | 0
That's .5sec per call for 13 calls. Yike?!
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22497542-d75cd9d9
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Should these be applied to the sender, the receiver or both, to get the desired speedup effect? That is, rebooting a test/backup machine which receives the writes might be a lot easier that kicking the corporate storage central over with a test kernel ;)
Of course, this all depends, but you get the idea of concerns possible ;)

And on a side note, is there an easy way to know what source commits or bug fixes are present in the running (installed) kernel - perhaps, a git/mq history in a standardly provided text file or something like that? If not, would it make sense to add such a file to the build-product of illumos-gate?
Thanks,
//Jim
--
Typos courtesy of K-9 Mail on my Samsung Android
Matthew Ahrens
2014-04-18 15:57:41 UTC
Permalink
Post by Jim Klimov
And on a side note, is there an easy way to know what source commits or
bug fixes are present in the running (installed) kernel - perhaps, a git/mq
history in a standardly provided text file or something like that? If not,
would it make sense to add such a file to the build-product of illumos-gate?
I think it would be great to do something like this, and include it kernel
memory so it's part of the crash dump as well.

As a practical suggestion, how about we take the first 1MB of "git log",
gzip it, and compile it into a kernel variable. Then add a mdb dcmd to
ungzip and print it out. I won't have time to implement this in the near
term; if others think this would be useful and have time, I'd be happy to
mentor / review.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ira Cooper
2014-04-18 16:07:56 UTC
Permalink
Post by Matthew Ahrens
Post by Jim Klimov
And on a side note, is there an easy way to know what source commits or
bug fixes are present in the running (installed) kernel - perhaps, a git/mq
history in a standardly provided text file or something like that? If not,
would it make sense to add such a file to the build-product of illumos-gate?
I think it would be great to do something like this, and include it kernel
memory so it's part of the crash dump as well.
As a practical suggestion, how about we take the first 1MB of "git log",
gzip it, and compile it into a kernel variable. Then add a mdb dcmd to
ungzip and print it out. I won't have time to implement this in the near
term; if others think this would be useful and have time, I'd be happy to
mentor / review.
git log --oneline - the actual commit text doesn't matter as much as which
commits made it. :)

-Ira



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2014-04-18 16:18:01 UTC
Permalink
Post by Ira Cooper
Post by Matthew Ahrens
Post by Jim Klimov
And on a side note, is there an easy way to know what source commits or
bug fixes are present in the running (installed) kernel - perhaps, a git/mq
history in a standardly provided text file or something like that? If not,
would it make sense to add such a file to the build-product of illumos-gate?
I think it would be great to do something like this, and include it
kernel memory so it's part of the crash dump as well.
As a practical suggestion, how about we take the first 1MB of "git log",
gzip it, and compile it into a kernel variable. Then add a mdb dcmd to
ungzip and print it out. I won't have time to implement this in the near
term; if others think this would be useful and have time, I'd be happy to
mentor / review.
git log --oneline - the actual commit text doesn't matter as much as which
commits made it. :)
For illumos-formatted commit messages, "git log --oneline" includes the
entire actual commit text, just in a harder to read format (all lines
smushed into one). I guess it leaves out the date and the author, both of
which are pretty useful.

Which would you rather read:

commit f7dbdfc7b241e42b135dc9118e41b127cb935483
Author: Marcel Telka <***@nexenta.com>
Date: Tue Jan 21 19:27:05 2014 +0100

4512 kclient(1m) should not depend on /usr/xpg4/bin/grep
Reviewed by: Andy Stormont <***@gmail.com>
Reviewed by: Garrett D'Amore <***@damore.org>
Approved by: Robert Mustacchi <***@joyent.com>

commit 19d32b9ab53d17ac6605971e14c45a5281f8d9bb
Author: Robert Mustacchi <***@joyent.com>
Date: Thu Dec 5 01:26:55 2013 +0000

4493 want siginfo
4494 Make dd show progress when you send INFO/USR1 signals
4495 dd could support O_SYNC and O_DSYNC
Reviewed by: Jerry Jelinek <***@joyent.com>
Reviewed by: Joshua M. Clulow <***@sysmgr.org>
Reviewed by: Richard Lowe <***@richlowe.net>
Reviewed by: Garrett D'Amore <***@damore.org>
Approved by: Garrett D'Amore <***@damore.org>

---- or -----

f7dbdfc 4512 kclient(1m) should not depend on /usr/xpg4/bin/grep Reviewed
by: Andy Stormont <***@gmail.com> Reviewed by: Garrett D'Amore <
***@damore.org> Approved by: Robert Mustacchi <***@joyent.com>
19d32b9 4493 want siginfo 4494 Make dd show progress when you send
INFO/USR1 signals 4495 dd could support O_SYNC and O_DSYNC Reviewed by:
Jerry Jelinek <***@joyent.com> Reviewed by: Joshua M. Clulow <
***@sysmgr.org> Reviewed by: Richard Lowe <***@richlowe.net> Reviewed
by: Garrett D'Amore <***@damore.org> Approved by: Garrett D'Amore <
***@damore.org>

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Dale Ghent
2014-04-18 16:53:43 UTC
Permalink
For illumos-formatted commit messages, "git log --oneline" includes the entire actual commit text, just in a harder to read format (all lines smushed into one). I guess it leaves out the date and the author, both of which are pretty useful.


1MB of commit logs turns out to be pretty small when compressed:

***@osdev2:/code/daleg-omnios-151008/illumos-omnios$ git log | dd iflag=fullblock bs=1024 count=1024 | gzip -c > /tmp/commitlog
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.0835981 s, 12.5 MB/s

***@osdev2:/code/daleg-omnios-151008/illumos-omnios$ ls -lh /tmp/commitlog
-rw-r--r-- 1 daleg users 330K Apr 18 16:48 /tmp/commitlog

Amusingly, ‘git log —oneline’ creates a larger product:

***@osdev2:/code/daleg-omnios-151008/illumos-omnios$ git log --oneline | dd iflag=fullblock bs=1024 count=1024 | gzip -c > /tmp/commitlog
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.191727 s, 5.5 MB/s

***@osdev2:/code/daleg-omnios-151008/illumos-omnios$ ls -lh /tmp/commitlog
-rw-r--r-- 1 daleg users 398K Apr 18 16:50 /tmp/commitlog

/dale
Josef 'Jeff' Sipek
2014-04-18 18:23:02 UTC
Permalink
Post by Matthew Ahrens
Post by Ira Cooper
Post by Matthew Ahrens
Post by Jim Klimov
And on a side note, is there an easy way to know what source commits or
bug fixes are present in the running (installed) kernel - perhaps, a git/mq
history in a standardly provided text file or something like that? If not,
would it make sense to add such a file to the build-product of illumos-gate?
I think it would be great to do something like this, and include it
kernel memory so it's part of the crash dump as well.
As a practical suggestion, how about we take the first 1MB of "git log",
gzip it, and compile it into a kernel variable. Then add a mdb dcmd to
ungzip and print it out. I won't have time to implement this in the near
term; if others think this would be useful and have time, I'd be happy to
mentor / review.
git log --oneline - the actual commit text doesn't matter as much as which
commits made it. :)
For illumos-formatted commit messages, "git log --oneline" includes the
entire actual commit text, just in a harder to read format (all lines
smushed into one). I guess it leaves out the date and the author, both of
which are pretty useful.
Why such a heavy-handed approach? Why not just stash the output from
`git describe --dirty --tags` into a const global and calling it descriptive
enough?

As far as mdb is concerned, you could teach the ::status about it.

Jeff.
--
We have joy, we have fun, we have Linux on a Sun...
Matthew Ahrens
2014-04-18 18:31:56 UTC
Permalink
On Fri, Apr 18, 2014 at 11:23 AM, Josef 'Jeff' Sipek
Post by Jim Klimov
Post by Matthew Ahrens
On Fri, Apr 18, 2014 at 11:57 AM, Matthew Ahrens <
Post by Matthew Ahrens
Post by Jim Klimov
And on a side note, is there an easy way to know what source commits
or
Post by Matthew Ahrens
Post by Matthew Ahrens
Post by Jim Klimov
bug fixes are present in the running (installed) kernel - perhaps, a
git/mq
Post by Matthew Ahrens
Post by Matthew Ahrens
Post by Jim Klimov
history in a standardly provided text file or something like that?
If not,
Post by Matthew Ahrens
Post by Matthew Ahrens
Post by Jim Klimov
would it make sense to add such a file to the build-product of
illumos-gate?
Post by Matthew Ahrens
Post by Matthew Ahrens
I think it would be great to do something like this, and include it
kernel memory so it's part of the crash dump as well.
As a practical suggestion, how about we take the first 1MB of "git
log",
Post by Matthew Ahrens
Post by Matthew Ahrens
gzip it, and compile it into a kernel variable. Then add a mdb dcmd
to
Post by Matthew Ahrens
Post by Matthew Ahrens
ungzip and print it out. I won't have time to implement this in the
near
Post by Matthew Ahrens
Post by Matthew Ahrens
term; if others think this would be useful and have time, I'd be
happy to
Post by Matthew Ahrens
Post by Matthew Ahrens
mentor / review.
git log --oneline - the actual commit text doesn't matter as much as
which
Post by Matthew Ahrens
commits made it. :)
For illumos-formatted commit messages, "git log --oneline" includes the
entire actual commit text, just in a harder to read format (all lines
smushed into one). I guess it leaves out the date and the author, both
of
Post by Matthew Ahrens
which are pretty useful.
Why such a heavy-handed approach? Why not just stash the output from
`git describe --dirty --tags` into a const global and calling it descriptive
enough?
I'm not a git expert so I definitely appreciate suggestions as to what data
to include. Running that command on my workspace doesn't give me anything
that I recognize. From this how would you figure out if a given bug is
fixed?

$ git describe --dirty --tags
codecomplete/4.0-207-g7774749

(Don't know what 4.0 is? I don't know the release scheme of every illumos
distro either.)

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Josef 'Jeff' Sipek
2014-04-18 18:41:15 UTC
Permalink
Post by Matthew Ahrens
Post by Josef 'Jeff' Sipek
Why such a heavy-handed approach? Why not just stash the output from
`git describe --dirty --tags` into a const global and calling it descriptive
enough?
I'm not a git expert so I definitely appreciate suggestions as to what data
to include. Running that command on my workspace doesn't give me anything
that I recognize. From this how would you figure out if a given bug is
fixed?
$ git describe --dirty --tags
codecomplete/4.0-207-g7774749
(Don't know what 4.0 is? I don't know the release scheme of every illumos
distro either.)
Note the -gXXXXXX blob. Those are the first n chars of the git commit.

Based on what you pasted, I know that the code you have checked out is:

(1) 207 commits past codecomplete/4.0
(2) the commit hash starts with '7774749'

That combined with the knowledge of which repo the code came from (which
your scheme kinda requires too unless *every* fix you care about happens to
fit in the 1MB) points me to a specific revision.

For example given 20140320-15-g83e627e from Joyent's SmartOS repo:

***@meili:~/illumos/illumos-joyent$ git show 20140320-15-g83e627e
commit 83e627e6cb26a200b4fce0b7aad1480202411103
Merge: 89b7e5b dff8ce8
Author: Keith M Wesolowski <***@foobazco.org>
Date: Mon Mar 24 21:02:12 2014 +0000

[illumos-gate merge]

commit dff8ce8858f30b8b43711766bd0f637548b8d700
3379 Duplicate assignment in uts/common/cpr/cpr_stat.c
commit 652fb50dec8e8b074b60a3c82d00248a2aeb5eb9
4653 net hooks registered with HH_BEFORE or HH_AFTER hints create invalid hint_value kstats
commit 4948216cdd0ccee7b9a4fd433ccab571afbb99e9
4679 want workaround for Intel erratum BT81
commit 56b8f71e3a910fbd2820f6841b40bfd85f9673c2
4601 memory leak in ILB daemon on startup
4602 memory leak in ILB daemon on import-config
4668 Memory leak in ilbd' new_req: getpeerucred() allocation isn't released at all
commit 61f9f3e6dc0a66ec5c243562765c1b4a3297e8a4
4688 getlogin_r shouldn't clobber memory


Jeff.
--
Once you have their hardware. Never give it back.
(The First Rule of Hardware Acquisition)
Josef 'Jeff' Sipek
2014-04-18 18:45:28 UTC
Permalink
Post by Josef 'Jeff' Sipek
Post by Matthew Ahrens
Post by Josef 'Jeff' Sipek
Why such a heavy-handed approach? Why not just stash the output from
`git describe --dirty --tags` into a const global and calling it descriptive
enough?
I'm not a git expert so I definitely appreciate suggestions as to what data
to include. Running that command on my workspace doesn't give me anything
that I recognize. From this how would you figure out if a given bug is
fixed?
$ git describe --dirty --tags
codecomplete/4.0-207-g7774749
(Don't know what 4.0 is? I don't know the release scheme of every illumos
distro either.)
Note the -gXXXXXX blob. Those are the first n chars of the git commit.
(1) 207 commits past codecomplete/4.0
(2) the commit hash starts with '7774749'
That combined with the knowledge of which repo the code came from (which
your scheme kinda requires too unless *every* fix you care about happens to
fit in the 1MB) points me to a specific revision.
commit 83e627e6cb26a200b4fce0b7aad1480202411103
Merge: 89b7e5b dff8ce8
Date: Mon Mar 24 21:02:12 2014 +0000
[illumos-gate merge]
commit dff8ce8858f30b8b43711766bd0f637548b8d700
3379 Duplicate assignment in uts/common/cpr/cpr_stat.c
commit 652fb50dec8e8b074b60a3c82d00248a2aeb5eb9
4653 net hooks registered with HH_BEFORE or HH_AFTER hints create invalid hint_value kstats
commit 4948216cdd0ccee7b9a4fd433ccab571afbb99e9
4679 want workaround for Intel erratum BT81
commit 56b8f71e3a910fbd2820f6841b40bfd85f9673c2
4601 memory leak in ILB daemon on startup
4602 memory leak in ILB daemon on import-config
4668 Memory leak in ilbd' new_req: getpeerucred() allocation isn't released at all
commit 61f9f3e6dc0a66ec5c243562765c1b4a3297e8a4
4688 getlogin_r shouldn't clobber memory
I suppose if you don't want to deal with branch/tag names, you can always
just get the raw hash:

$ git rev-parse HEAD
83e627e6cb26a200b4fce0b7aad1480202411103

Jeff.
--
Real Programmers consider "what you see is what you get" to be just as bad a
concept in Text Editors as it is in women. No, the Real Programmer wants a
"you asked for it, you got it" text editor -- complicated, cryptic,
powerful, unforgiving, dangerous.
Matthew Ahrens
2014-04-18 18:48:41 UTC
Permalink
On Fri, Apr 18, 2014 at 11:45 AM, Josef 'Jeff' Sipek
Post by Josef 'Jeff' Sipek
I suppose if you don't want to deal with branch/tag names, you can always
$ git rev-parse HEAD
83e627e6cb26a200b4fce0b7aad1480202411103
So I need to know every possible illumos distro find their git repos to
figure out what bugs are fixed? I guess it's better than nothing but seems
inferior to my proposal.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore
2014-04-18 18:52:29 UTC
Permalink
This underscores a need we’ve had in the illumos community for some time: basic release engineering including numbered versions.

I will probalby be investing a certain amount of effort into this in the coming weeks.  Its long past time we did this.

-- 
Garrett D'Amore
Sent with Airmail

On April 18, 2014 at 11:48:55 AM, Matthew Ahrens (***@lists.illumos.org) wrote:

On Fri, Apr 18, 2014 at 11:45 AM, Josef 'Jeff' Sipek <***@josefsipek.net> wrote:

I suppose if you don't want to deal with branch/tag names, you can always
just get the raw hash:

$ git rev-parse HEAD
83e627e6cb26a200b4fce0b7aad1480202411103

So I need to know every possible illumos distro find their git repos to figure out what bugs are fixed?  I guess it's better than nothing but seems inferior to my proposal.

--matt
illumos-zfs | Archives | Modify Your Subscription


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Josef 'Jeff' Sipek
2014-04-18 18:55:36 UTC
Permalink
Post by Garrett D'Amore
basic release engineering including numbered versions.
Are you talking about just periodic version bump? Or do you have something
more complicated in mind?
Post by Garrett D'Amore
I will probalby be investing a certain amount of effort into this in the
coming weeks.  Its long past time we did this.
Agreed, thanks.

Jeff.
Post by Garrett D'Amore
-- 
Garrett D'Amore
Sent with Airmail
I suppose if you don't want to deal with branch/tag names, you can always
$ git rev-parse HEAD
83e627e6cb26a200b4fce0b7aad1480202411103
So I need to know every possible illumos distro find their git repos to figure out what bugs are fixed?  I guess it's better than nothing but seems inferior to my proposal.
--matt
illumos-zfs | Archives | Modify Your Subscription
--
What is the difference between Mechanical Engineers and Civil Engineers?
Mechanical Engineers build weapons, Civil Engineers build targets.
Matthew Ahrens
2014-04-18 18:58:03 UTC
Permalink
On Fri, Apr 18, 2014 at 11:41 AM, Josef 'Jeff' Sipek
Post by Matthew Ahrens
On Fri, Apr 18, 2014 at 11:23 AM, Josef 'Jeff' Sipek <
Post by Josef 'Jeff' Sipek
Why such a heavy-handed approach? Why not just stash the output from
`git describe --dirty --tags` into a const global and calling it descriptive
enough?
I'm not a git expert so I definitely appreciate suggestions as to what
data
to include. Running that command on my workspace doesn't give me
anything
that I recognize. From this how would you figure out if a given bug is
fixed?
$ git describe --dirty --tags
codecomplete/4.0-207-g7774749
(Don't know what 4.0 is? I don't know the release scheme of every
illumos
distro either.)
Note the -gXXXXXX blob. Those are the first n chars of the git commit.
(1) 207 commits past codecomplete/4.0
(2) the commit hash starts with '7774749'
That combined with the knowledge of which repo the code came from (which
your scheme kinda requires too unless *every* fix you care about happens to
fit in the 1MB) points me to a specific revision.
That's true.

As far as I can tell, my proposal gives a superset of what anyone else has
suggested, and I haven't seen any arguments why that is bad. I think it's
better because I don't have to know all possible git repos. If I see that
a given build has all the illumos commits from April 2012 - March 2014 (1MB
worth), I can reasonably guess that it has all the commits from before that
too.

But you do have a point, so I'll modify my suggestion: just include the
entire "git log", gzipped, in the running kernel. It's only 1.5MB.

All that said, I suppose the tie goes to whoever actually implements
something like this.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Josef 'Jeff' Sipek
2014-04-18 19:48:50 UTC
Permalink
Post by Matthew Ahrens
As far as I can tell, my proposal gives a superset of what anyone else has
suggested, and I haven't seen any arguments why that is bad. I think it's
better because I don't have to know all possible git repos. If I see that
a given build has all the illumos commits from April 2012 - March 2014 (1MB
worth), I can reasonably guess that it has all the commits from before that
too.
-rwxr-xr-x 2 root sys 1.8M Apr 16 11:41 /kernel/fs/amd64/zfs

Stuffing in git-log output will grow the binary size non-trivially.
Post by Matthew Ahrens
But you do have a point, so I'll modify my suggestion: just include the
entire "git log", gzipped, in the running kernel. It's only 1.5MB.
Great... what have I done? :)
Post by Matthew Ahrens
All that said, I suppose the tie goes to whoever actually implements
something like this.
Agreed. :)

Jeff.
--
If I have trouble installing Linux, something is wrong. Very wrong.
- Linus Torvalds
Matthew Ahrens
2014-04-18 20:02:21 UTC
Permalink
On Fri, Apr 18, 2014 at 12:48 PM, Josef 'Jeff' Sipek
Post by Josef 'Jeff' Sipek
Post by Matthew Ahrens
As far as I can tell, my proposal gives a superset of what anyone else
has
Post by Matthew Ahrens
suggested, and I haven't seen any arguments why that is bad. I think
it's
Post by Matthew Ahrens
better because I don't have to know all possible git repos. If I see
that
Post by Matthew Ahrens
a given build has all the illumos commits from April 2012 - March 2014
(1MB
Post by Matthew Ahrens
worth), I can reasonably guess that it has all the commits from before
that
Post by Matthew Ahrens
too.
-rwxr-xr-x 2 root sys 1.8M Apr 16 11:41 /kernel/fs/amd64/zfs
Stuffing in git-log output will grow the binary size non-trivially.
To be fair, git log covers the entire kernel (and more), not just ZFS. I
think it would be more appropriate to compare to the amount of memory used
for all (more or less) permanently allocated kernel memory. Which is more
on the scale of dozens of MB (genunix is 4.5MB; /kernel/drv/amd64 is 17MB
(though not all are loaded)).

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Loading...