lzjb improvement

Thanks for letting us know about this. I have a few comments:

1. We could eliminate a branch entirely by doing this:

mlen = MIN(d_end - dst, mlen);
while (--mlen >= 0)
*dst++ = *cpy++

2. This optimization could become a generic optimization pass in the
system compiler. If anyone wants to be helpful, I would suggest sending
this patch to the developers of both LLVM and GCC.

3. Have any benchmarks been done to quantify the performance
improvement? That should not be required to merge this, but if any
benchmarks have been done, I would like to see them for my own curiosity.

Post by Xin Li
Hi,
FYI, I think you may be interested in one of our (FreeBSD) recent change
to lzjb decompression.
Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.
Received: from [64.62.153.212] (helo=anubis.delphij.net)
by node002.open-zfs.net
with esmtp (HybridCluster distributed mail proxy)
Submitted by: Denis Ahrens <denis h3q com>
Should I file a ticket for this for tracking?
Cheers,
_______________________________________________
developer mailing list
http://lists.open-zfs.org/mailman/listinfo/developer

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Xin Li

2013-10-11 03:29:47 UTC

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

2. This optimization could become a generic optimization pass in
the system compiler. If anyone wants to be helpful, I would
suggest sending this patch to the developers of both LLVM and GCC.

I would ask a LLVM developers for that. If you know any GCC
developer, it would be great to let them know as well.

3. Have any benchmarks been done to quantify the performance
improvement? That should not be required to merge this, but if any
benchmarks have been done, I would like to see them for my own curiosity.

Well to tell the truth I didn't benchmarked the code because looking
at the assembler, the change is quite clearly an improvement (smaller
code inside the loop and the actual working code is quite simple).
The author did some benchmarks and claims to have 6% to 10%
improvement over the previous version of code with various different
types of data, but it is still interesting if someone would do a
scientific benchmark.

Post by Xin Li
Hi,
FYI, I think you may be interested in one of our (FreeBSD)
recent change to lzjb decompression.
Improve lzjb decompress performance by reorganizing the code to
tighten the copy loop.
Received: from [64.62.153.212] (helo=anubis.delphij.net) by
node002.open-zfs.net with esmtp (HybridCluster distributed mail
02:39:10 -0000 Submitted by: Denis Ahrens <denis h3q com>
Should I file a ticket for this for tracking?
Cheers,
_______________________________________________ developer
http://lists.open-zfs.org/mailman/listinfo/developer

- --
Xin LI <***@delphij.net> https://www.delphij.net/
FreeBSD - The Power to Serve! Live free or die

Richard Yao

2013-10-11 03:38:40 UTC

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

2. This optimization could become a generic optimization pass in
the system compiler. If anyone wants to be helpful, I would
suggest sending this patch to the developers of both LLVM and GCC.

I would ask a LLVM developers for that. If you know any GCC
developer, it would be great to let them know as well.

I will ping them on freenode.

I agree that it is an obvious improvement.

Post by Xin Li
Hi,
FYI, I think you may be interested in one of our (FreeBSD)
recent change to lzjb decompression.
Improve lzjb decompress performance by reorganizing the code to
tighten the copy loop.
Received: from [64.62.153.212] (helo=anubis.delphij.net) by
node002.open-zfs.net with esmtp (HybridCluster distributed mail
02:39:10 -0000 Submitted by: Denis Ahrens <denis h3q com>
Should I file a ticket for this for tracking?
Cheers,
_______________________________________________ developer
http://lists.open-zfs.org/mailman/listinfo/developer

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-11 03:44:58 UTC

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

On second thought, I was right the first time. It is possible to do this
without branching:

#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))

http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax

This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-11 03:51:28 UTC

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.

In hindsight, I should have left this for tomorrow before I replied. It
is close to midnight and it shows in the number of errors in what I write.

In the above email, the second MIN macro was actually a MAX macro and
that this trick only works with signed integers. Also, a good optimizing
compiler should be able to recognize MIN/MAX and generate something like
this when it is beneficial; that makes it unnecessary to worry too much
about this.

With that said, using the MIN macro as I suggested makes the code more
readable.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Matthew Ahrens

2013-10-11 03:53:34 UTC

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

This is complicated enough that I would definitely want to see the
performance analysis before introducing this trick. Even according to the
webpage: "On some rare machines ... [it] might be faster than the obvious
approach ... On some machines ... there may be no advantage. ... gcc
produced the same code on a Pentium as the obvious solution "

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Eitan Adler

2013-10-11 04:32:45 UTC

Post by Matthew Ahrens

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

Measured on a Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz"

Units are rdtsc diffs of 10000000 iterations across random arrays
(different in each attempt).

#define MIN_A(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN_B(a, b) a > b ? b : a

x mina
+ minb
N Min Max Median Avg Stddev
x 656 58 760 87 112.24543 67.992472
+ 656 58 254 87 84.39939 12.143
Difference at 95.0% confidence
-27.846 +/- 5.28546
-24.8082% +/- 4.70884%
(Student's t, pooled s = 48.8387)

I probably made some mistake with my benchmarking but I don't think
the numbers lean toward the bit twiddly one.

--
Eitan Adler

Eitan Adler

2013-10-11 03:55:14 UTC

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))

This does not do what you say it does.

Post by Richard Yao
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
This makes MIN(d_end - dst, mlen) look inefficient, but a proper
optimizing compiler should store the result of d_end - dst in a register
to avoid doing the subtraction 3 times.

Please let the compiler do this work! It is possible that the icache
pollution from multiple xors, ands and a cmp will be worse than a
single, well predicted branch. The compiler may also have other
optimizations up its sleeve.

Do you have any evidence that this "optimization" is useful?

--
Eitan Adler

Richard Yao

2013-10-11 04:15:05 UTC

It is late here and I think that I will refrain from sending emails this
late without some sort of stimulant in my system in the future. At this
point, I think it is safe to say that the stream of corrections has had
a similar effect and I am fully awake now.

I think using MIN() makes the code more readable. The point about
eliminating the branch was an error on my part. My reference to bit
twiddling hacks was meant more to say that it could be done than that it
was a good idea.

With that said, the compiler should be smart enough to do this trick on
its own when the CPU benefits from it and I am content to let it do that.

Post by Eitan Adler

mlen = MIN(d_end - dst, mlen); while (--mlen >= 0) *dst++ = *cpy++

I don't think this eliminates the branching as MIN is usually a macro
that expands to a > b ? b : a.

My mistake. I was thinking of generic swap routines. I do think that
using the MIN() macro is more readable though.

On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))

This does not do what you say it does.

Please let the compiler do this work! It is possible that the icache
pollution from multiple xors, ands and a cmp will be worse than a
single, well predicted branch. The compiler may also have other
optimizations up its sleeve.
Do you have any evidence that this "optimization" is useful?

Joerg Schilling

2013-10-11 10:12:11 UTC

Post by Richard Yao
On second thought, I was right the first time. It is possible to do this
#define MIN(x, y) ((y) ^ (((x) ^ (y)) & -((x) < (y))))
#define MIN(x, y) ((x) ^ (((x) ^ (y)) & -((x) < (y))))

I am not convinced becase ((x) < (y)) is a comparison followed by a conditional
1 or 0 assignement.

Jörg

--
EMail:joerg-3Qm2Liu6aU2sY6utFDHCwYAplN+***@public.gmane.org (home) Jörg Schilling D-13353 Berlin
js-CFLBMwTPW48UNGrzBIF7/***@public.gmane.org (uni)
joerg.schilling-8LS2qeF34IpklNlQbfROjRvVK+***@public.gmane.org (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Richard Yao

2013-10-13 18:03:52 UTC