Discussion:
lzjb improvement
Xin Li
2013-10-11 02:39:08 UTC
Permalink
Hi,

FYI, I think you may be interested in one of our (FreeBSD) recent change
to lzjb decompression.

FreeBSD r256132:

Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.

Submitted by: Denis Ahrens <denis h3q com>

Should I file a ticket for this for tracking?

Cheers,
--
Xin LI <***@delphij.net> https://www.delphij.net/
FreeBSD - The Power to Serve! Live free or die
Richard Yao
2013-10-11 03:18:53 UTC
Permalink
Thanks for letting us know about this. I have a few comments:

1. We could eliminate a branch entirely by doing this:

mlen = MIN(d_end - dst, mlen);
while (--mlen >= 0)
*dst++ = *cpy++

2. This optimization could become a generic optimization pass in the
system compiler. If anyone wants to be helpful, I would suggest sending
this patch to the developers of both LLVM and GCC.

3. Have any benchmarks been done to quantify the performance
improvement? That should not be required to merge this, but if any
benchmarks have been done, I would like to see them for my own curiosity.
Post by Xin Li
Hi,
FYI, I think you may be interested in one of our (FreeBSD) recent change
to lzjb decompression.
Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.
Received: from [64.62.153.212] (helo=anubis.delphij.net)
by node002.open-zfs.net
with esmtp (HybridCluster distributed mail proxy)
Submitted by: Denis Ahrens <denis h3q com>
Should I file a ticket for this for tracking?
Cheers,
_______________________________________________
developer mailing list
http://lists.open-zfs.org/mailman/listinfo/developer
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Xin Li
2013-10-11 03:29:47 UTC
Permalink
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
2. This optimization could become a generic optimization pass in
the system compiler. If anyone wants to be helpful, I would
suggest sending this patch to the developers of both LLVM and GCC.
I would ask a LLVM developers for that. If you know any GCC
developer, it would be great to let them know as well.
3. Have any benchmarks been done to quantify the performance
improvement? That should not be required to merge this, but if any
benchmarks have been done, I would like to see them for my own curiosity.
Well to tell the truth I didn't benchmarked the code because looking
at the assembler, the change is quite clearly an improvement (smaller
code inside the loop and the actual working code is quite simple).
The author did some benchmarks and claims to have 6% to 10%
improvement over the previous version of code with various different
types of data, but it is still interesting if someone would do a
scientific benchmark.
Post by Xin Li
Hi,
FYI, I think you may be interested in one of our (FreeBSD)
recent change to lzjb decompression.
Improve lzjb decompress performance by reorganizing the code to
tighten the copy loop.
Received: from [64.62.153.212] (helo=anubis.delphij.net) by
node002.open-zfs.net with esmtp (HybridCluster distributed mail
02:39:10 -0000 Submitted by: Denis Ahrens <denis h3q com>
Should I file a ticket for this for tracking?
Cheers,
_______________________________________________ developer
http://lists.open-zfs.org/mailman/listinfo/developer
- --
Xin LI <***@delphij.net> https://www.delphij.net/
FreeBSD - The Power to Serve! Live free or die
Richard Yao
2013-10-11 03:38:40 UTC
Permalink
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
Post by Xin Li
2. This optimization could become a generic optimization pass in
the system compiler. If anyone wants to be helpful, I would
suggest sending this patch to the developers of both LLVM and GCC.
I would ask a LLVM developers for that. If you know any GCC
developer, it would be great to let them know as well.
I will ping them on freenode.
Post by Xin Li
3. Have any benchmarks been done to quantify the performance
improvement? That should not be required to merge this, but if any
benchmarks have been done, I would like to see them for my own curiosity.
Well to tell the truth I didn't benchmarked the code because looking
at the assembler, the change is quite clearly an improvement (smaller
code inside the loop and the actual working code is quite simple).
The author did some benchmarks and claims to have 6% to 10%
improvement over the previous version of code with various different
types of data, but it is still interesting if someone would do a
scientific benchmark.
I agree that it is an obvious improvement.
Post by Xin Li
Post by Xin Li
Hi,
FYI, I think you may be interested in one of our (FreeBSD)
recent change to lzjb decompression.
Improve lzjb decompress performance by reorganizing the code to
tighten the copy loop.
Received: from [64.62.153.212] (helo=anubis.delphij.net) by
node002.open-zfs.net with esmtp (HybridCluster distributed mail
02:39:10 -0000 Submitted by: Denis Ahrens <denis h3q com>
Should I file a ticket for this for tracking?
Cheers,
_______________________________________________ developer
http://lists.open-zfs.org/mailman/listinfo/developer
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-10-11 03:44:58 UTC
Permalink
Post by Richard Yao
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
On second thought, I was right the first time. It is possible to do this
without branching:

#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))

http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax

This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-10-11 03:51:28 UTC
Permalink
Post by Richard Yao
Post by Richard Yao
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.
In hindsight, I should have left this for tomorrow before I replied. It
is close to midnight and it shows in the number of errors in what I write.

In the above email, the second MIN macro was actually a MAX macro and
that this trick only works with signed integers. Also, a good optimizing
compiler should be able to recognize MIN/MAX and generate something like
this when it is beneficial; that makes it unnecessary to worry too much
about this.

With that said, using the MIN macro as I suggested makes the code more
readable.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-10-11 03:53:34 UTC
Permalink
Post by Richard Yao
Post by Richard Yao
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.
This is complicated enough that I would definitely want to see the
performance analysis before introducing this trick. Even according to the
webpage: "On some rare machines ... [it] might be faster than the obvious
approach ... On some machines ... there may be no advantage. ... gcc
produced the same code on a Pentium as the obvious solution "

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Eitan Adler
2013-10-11 04:32:45 UTC
Permalink
Post by Matthew Ahrens
Post by Richard Yao
Post by Richard Yao
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.
This is complicated enough that I would definitely want to see the
performance analysis before introducing this trick. Even according to the
webpage: "On some rare machines ... [it] might be faster than the obvious
approach ... On some machines ... there may be no advantage. ... gcc
produced the same code on a Pentium as the obvious solution "
Measured on a Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz"

Units are rdtsc diffs of 10000000 iterations across random arrays
(different in each attempt).

#define MIN_A(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN_B(a, b) a > b ? b : a

x mina
+ minb
N Min Max Median Avg Stddev
x 656 58 760 87 112.24543 67.992472
+ 656 58 254 87 84.39939 12.143
Difference at 95.0% confidence
-27.846 +/- 5.28546
-24.8082% +/- 4.70884%
(Student's t, pooled s = 48.8387)

I probably made some mistake with my benchmarking but I don't think
the numbers lean toward the bit twiddly one.
--
Eitan Adler
Eitan Adler
2013-10-11 03:55:14 UTC
Permalink
Post by Richard Yao
Post by Richard Yao
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
This does not do what you say it does.
Post by Richard Yao
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.
Please let the compiler do this work! It is possible that the icache
pollution from multiple xors, ands and a cmp will be worse than a
single, well predicted branch. The compiler may also have other
optimizations up its sleeve.

Do you have any evidence that this "optimization" is useful?
--
Eitan Adler
Richard Yao
2013-10-11 04:15:05 UTC
Permalink
It is late here and I think that I will refrain from sending emails this
late without some sort of stimulant in my system in the future. At this
point, I think it is safe to say that the stream of corrections has had
a similar effect and I am fully awake now.

I think using MIN() makes the code more readable. The point about
eliminating the branch was an error on my part. My reference to bit
twiddling hacks was meant more to say that it could be done than that it
was a good idea.

With that said, the compiler should be smart enough to do this trick on
its own when the CPU benefits from it and I am content to let it do that.
Post by Eitan Adler
Post by Richard Yao
Post by Richard Yao
Post by Xin Li
mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++
I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.
My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
This does not do what you say it does.
Post by Richard Yao
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.
Please let the compiler do this work! It is possible that the icache
pollution from multiple xors, ands and a cmp will be worse than a
single, well predicted branch. The compiler may also have other
optimizations up its sleeve.
Do you have any evidence that this "optimization" is useful?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Joerg Schilling
2013-10-11 10:12:11 UTC
Permalink
Post by Richard Yao
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
I am not convinced becase ((x) < (y)) is a comparison followed by a conditional
1 or 0 assignement.

Jörg
--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
joerg.schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Richard Yao
2013-10-13 18:03:52 UTC
Permalink
Post by Xin Li
Post by Richard Yao
2. This optimization could become a generic optimization pass in
the system compiler. If anyone wants to be helpful, I would
suggest sending this patch to the developers of both LLVM and GCC.
I would ask a LLVM developers for that. If you know any GCC
developer, it would be great to let them know as well.
I have filed a bug in the GCC bug tracker:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58715

I also sent information about this to the Ekopath compiler developers at
Pathscale. I have attached the sample files that I sent to Pathscale.



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...