From: Greg Ungerer <gerg@linux-m68k.org>
To: George Spelvin <linux@horizon.com>,
geert@linux-m68k.org, linux-m68k@lists.linux-m68k.org
Subject: Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
Date: Thu, 12 May 2016 23:14:10 +1000 [thread overview]
Message-ID: <573481A2.4040504@linux-m68k.org> (raw)
In-Reply-To: <20160512080429.31536.qmail@ns.horizon.com>
Hi George,
On 12/05/16 18:04, George Spelvin wrote:
> Thanks for the response!
>
>> That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
>> was lb1sf68.asm in older versions of gcc). It has always had that
>> coldfire conditional code in it. Not sure why it was in there though.
>
> I went digging through a lot of ColdFire documentation to see if
> there were ever any that needed it, and apparently it's been in there
> since the beginning.
>
> It did occur to me that perhaps someone wanted to support tne
> intersection of all 68k family CPUs, so that meant only MULU.W
> and ADD.L. But I can't see a gcc option to ask for that.
>
>
> Mostly I just thought that saving a few cycles was a nice win.
Yep, that is always nice.
After a few test builds I am pretty sure that we never need or
use mulsi3 on ColdFire. So I think we can move to not even bothering
to compile it for that case.
> This is tangential to some work I'm doing on the hash functions.
> in <linux/hash.h>, there are hash_32(x, bits) and hash_64(x, bits)
> which basically do "return (x * CONSTANT) >> (32 - bits)", or
> 64 - bits for the 64-bit version.
>
> The CONSTANTs were chosen to be efficient when computed using
> shift-and-add sequences, but that made them less effective hashes.
>
> The 64-bit constant was recently revised to avoid a bad case found when
> hashing page-aligned pointers, since it's a pointless optimization;
> all 64-bit processors have fast multipliers. But 32 bits was left alone
> for fear of breaking performance.
>
> After some research (partly by searching for __mulsi3), I found that m68k
> is the only platform that doesn't have a fast multiplier. SPARC used
> to be in that boat too, but SPARCv7 syupport was dropped a few years
> ago.
>
> That led to me optimizing the multiply-by-constant code for m68k.
>
>
>
> Do you have any interest in the matter? Basically, the options are
> to optimize the implementation of the multiplication, or come up with
> a different hash function that works better in that environment.
It looks like only the original m68000 targets will benefit from
a faster mulsi3. But there is nothing wrong with making it faster
for that case in general.
I don't have any feel for how much of a performance problem the
hash functions are on m68000 though. Keep in mind that those are
all non-MMU platforms.
Regards
Greg
> The current multiply by 0x9e370001 compiles to, depending on compiler
> flags, either a call to __mulsi3 or (omitting prologue & epilogue,
> since it's designed as an in-line function):
>
> move.l %d0,%d1 4
> lsl.l #6,%d1 20 24
> move.l %d1,%d2 4 28
> lsl.l #3,%d2 14 42
> sub.l %d1,%d2 8 50
> sub.l %d0,%d2 8 58
> move.l %d2,%d1 4 62
> lsl.l #3,%d1 14 76
> sub.l %d2,%d1 8 84
> lsl.l #3,%d1 14 98
> add.l %d0,%d1 8 106
> swap %d1 4 110
> clr.w %d1 4 114
> sub.l %d1,%d0 8 122
>
> (The columns on the right are cycle counts.)
>
> Tangent: I'm surprised I can't coax it into the equivalent:
> move.w %d0,%d1 4
> mulu.w #0x9e37,%d1 58 62
> add.l %d1,%d0 8 70
>
> Anyway, the problem is the low Hamming weight of the low 16 bits,
> which has to be solved somehow. The "more random" high 16 bits of the
> multiplier are only multiplied by the low 16 bits of the argument.
> The high 16 bits of the argument are just multiplied by 1. If the
> argument is a page-aligned pointer, with 12 low-order bits zero, we have
> very poor mixing.
>
> Using the better constant 0x61C88647 (both this and the previous one are
> chosen to be close to phi, the golden ratio (sqrt(5)-1)/2, but this is
> the negative), gcc struggles.
>
> I searched for an efficient shift-and-add sequence,
> and came up with the 68000-optimized:
>
> /* Multiply by 0x61C88647 */
> unsigned hash_32(unsigned x)
> {
> unsigned a, b, c;
>
> a = b = c = x;
> c <<= 19;
> a += c; /* a = (x << 19) + x */
> b <<= 9;
> b += a; /* b = (x << 9) + a */
> c <<= 4;
> c += b; /* c = (x << 23) + b */
> b <<= 5;
> b += c; /* (b << 5) + c */
> b <<= 3;
> b += a; /* ((b << 5) + c << 3) + a */
> b <<= 3;
> b -= c; /* (((b << 5) + c << 3) + a << 3) - c */
> return x;
> }
>
> which gcc turns into:
> move.l %d1,%d2 4
> lsl.w #3,%d2 12 16
> swap %d2 4 20
> clr.w %d2 4 24
> move.l %d1,%a1 4 28
> add.l %d2,%a1 8 36
> moveq #9,%d0 4 40
> lsl.l %d0,%d1 26 66
> add.l %a1,%d1 8 74
> lsl.l #4,%d2 16 90
> move.l %d1,%a0 4 94
> add.l %d2,%a0 8 102
> lsl.l #5,%d1 18 120
> move.l %a0,%d0 4 124
> add.l %d1,%d0 8 132
> move.l %d0,%d1 4 136
> lsl.l #3,%d1 14 150
> move.l %a1,%d0 4 154
> add.l %d1,%d0 8 162
> lsl.l #3,%d0 14 176
> sub.l %a0,%d0 8 184
>
> I can shave a few cycles with hand-optimized code, but the result is
> barely faster than the much smaller multiply-based:
>
> ; y = #k * x, with z as a temp
> move.w #klow, y ; 8
> move.w x, z ; 4 12
> swap x ; 4 16
> mulu.w y, x ; 52 68 (Using klow = 0x8647)
> mulu.w z, y ; 54 122 (Assuming 50% ones)
> mulu.w #khigh, z ; 54 176 (Using khigh = 0x61C8)
> swap y ; 4 180
> add.w x, y ; 4 184
> add.w z, y ; 4 188
> swap y ; 4 192
> ;; 12 words
>
> Another way to divide it up is to compute xlow * khigh with a multiply,
> and x * klow with shifts and adds. x * 0x8647 can be evauated as:
>
> unsigned a = (x << 9) + x;
> unsigned b = (x << 1) + a;
> unsigned c = (b << 1) + (a << 6) + a;
> return c + ((uint16_t)(x * 0x61C8) << 16);
>
> If I write it in the following form:
> register unsigned a, b;
>
> b = x << 1;
> asm("" : "=g" (a) : "0" (b)); /* "a = b, DAMN IT" */
> a <<= 8;
> a += x;
> b += a;
> b += b;
> b += a;
> a <<= 6;
> return a + b + ((uint16_t)(x * 0x61c8) << 16);
>
> The gcc can be coaxed into generating the desired code.
> (The "asm" is the only way I found to stop GCC from collapsing
> "(x << 1) << 8" to "x << 9", which is slower.)
>
> baz:
> move.l 4(%sp),%d1
> move.l %d1,%a0 4
> add.l %d1,%a0 8 12
> move.l %a0,%d0 4 16
> lsl.l #8,%d0 24 40
> add.l %d1,%d0 8 48
> add.l %d0,%a0 8 56
> muls.w #25032,%d1 56 112
> swap %d1 4 116
> clr.w %d1 4 120
> add.l %d0,%d1 8 128
> add.l %a0,%d1 8 136
> add.l %a0,%d1 8 144
> lsl.l #6,%d0 20 164
> add.l %d1,%d0 8 172
> rts
>
> Well, that's a *little* bit faster.
>
>
>
> Alternatively, I could just invent a whole different hash function for
> the case. One idea is to use two, rather than three, 16-bit multiplies.
> Split the input into two 16-bit halves, multiply them by separate
> constants, sum the results, and then swap the halves.
>
> move.w x, y 4
> swap x 4 8
> mulu.w #K1, y 58 66
> mulu.w #K2, x 58 124
> add.l y, x 8 132
> swap x 4 136
>
> The swap is important; hash_32() is documented as putting the "most mixed"
> bits in the most significant bits, as hash tables are indexed by the most
> significant bits.
>
> If you have any thoughts on the matter, they'd definitely be appreciated.
>
next prev parent reply other threads:[~2016-05-12 13:14 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-11 10:24 [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize] George Spelvin
2016-05-11 12:38 ` Greg Ungerer
2016-05-12 8:04 ` George Spelvin
2016-05-12 8:35 ` Andreas Schwab
2016-05-12 13:14 ` Greg Ungerer [this message]
2016-05-12 12:46 ` Greg Ungerer
2016-05-12 20:52 ` George Spelvin
2016-05-13 1:07 ` Greg Ungerer
2016-05-13 2:36 ` George Spelvin
2016-05-13 6:45 ` Greg Ungerer
2016-05-13 9:02 ` George Spelvin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=573481A2.4040504@linux-m68k.org \
--to=gerg@linux-m68k.org \
--cc=geert@linux-m68k.org \
--cc=linux-m68k@lists.linux-m68k.org \
--cc=linux@horizon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox