Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]

public inbox for linux-m68k@lists.linux-m68k.org
 help / color / mirror / Atom feed

From: Greg Ungerer <gerg@linux-m68k.org>
To: George Spelvin <linux@horizon.com>,
	geert@linux-m68k.org, linux-m68k@lists.linux-m68k.org
Subject: Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
Date: Thu, 12 May 2016 23:14:10 +1000	[thread overview]
Message-ID: <573481A2.4040504@linux-m68k.org> (raw)
In-Reply-To: <20160512080429.31536.qmail@ns.horizon.com>

Hi George,

On 12/05/16 18:04, George Spelvin wrote:
> Thanks for the response!
>
>> That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
>> was lb1sf68.asm in older versions of gcc). It has always had that
>> coldfire conditional code in it. Not sure why it was in there though.
>
> I went digging through a lot of ColdFire documentation to see if
> there were ever any that needed it, and apparently it's been in there
> since the beginning.
>
> It did occur to me that perhaps someone wanted to support tne
> intersection of all 68k family CPUs, so that meant only MULU.W
> and ADD.L.  But I can't see a gcc option to ask for that.
>
>
> Mostly I just thought that saving a few cycles was a nice win.

Yep, that is always nice.

After a few test builds I am pretty sure that we never need or
use mulsi3 on ColdFire. So I think we can move to not even bothering
to compile it for that case.


> This is tangential to some work I'm doing on the hash functions.
> in <linux/hash.h>, there are hash_32(x, bits) and hash_64(x, bits)
> which basically do "return (x * CONSTANT) >> (32 - bits)", or
> 64 - bits for the 64-bit version.
>
> The CONSTANTs were chosen to be efficient when computed using
> shift-and-add sequences, but that made them less effective hashes.
>
> The 64-bit constant was recently revised to avoid a bad case found when
> hashing page-aligned pointers, since it's a pointless optimization;
> all 64-bit processors have fast multipliers.  But 32 bits was left alone
> for fear of breaking performance.
>
> After some research (partly by searching for __mulsi3), I found that m68k
> is the only platform that doesn't have a fast multiplier.  SPARC used
> to be in that boat too, but SPARCv7 syupport was dropped a few years
> ago.
>
> That led to me optimizing the multiply-by-constant code for m68k.
>
>
>
> Do you have any interest in the matter?  Basically, the options are
> to optimize the implementation of the multiplication, or come up with
> a different hash function that works better in that environment.

It looks like only the original m68000 targets will benefit from
a faster mulsi3. But there is nothing wrong with making it faster
for that case in general.

I don't have any feel for how much of a performance problem the
hash functions are on m68000 though. Keep in mind that those are
all non-MMU platforms.

Regards
Greg


> The current multiply by 0x9e370001 compiles to, depending on compiler
> flags, either a call to __mulsi3 or (omitting prologue & epilogue,
> since it's designed as an in-line function):
>
> 	move.l	%d0,%d1		 4
> 	lsl.l	#6,%d1		20  24
> 	move.l	%d1,%d2		 4  28
> 	lsl.l	#3,%d2		14  42
> 	sub.l	%d1,%d2		 8  50
> 	sub.l	%d0,%d2		 8  58
> 	move.l	%d2,%d1		 4  62
> 	lsl.l	#3,%d1		14  76
> 	sub.l	%d2,%d1		 8  84
> 	lsl.l	#3,%d1		14  98
> 	add.l	%d0,%d1		 8 106
> 	swap	%d1		 4 110
> 	clr.w	%d1		 4 114
> 	sub.l	%d1,%d0		 8 122
>
> (The columns on the right are cycle counts.)
>
> Tangent: I'm surprised I can't coax it into the equivalent:
> 	move.w %d0,%d1		 4
> 	mulu.w #0x9e37,%d1	58  62
> 	add.l %d1,%d0		 8  70
>
> Anyway, the problem is the low Hamming weight of the low 16 bits,
> which has to be solved somehow.  The "more random" high 16 bits of the
> multiplier are only multiplied by the low 16 bits of the argument.
> The high 16 bits of the argument are just multiplied by 1.  If the
> argument is a page-aligned pointer, with 12 low-order bits zero, we have
> very poor mixing.
>
> Using the better constant 0x61C88647 (both this and the previous one are
> chosen to be close to phi, the golden ratio (sqrt(5)-1)/2, but this is
> the negative), gcc struggles.
>
> I searched for an efficient shift-and-add sequence,
> and came up with the 68000-optimized:
>
> /* Multiply by 0x61C88647 */
> unsigned hash_32(unsigned x)
> {
> 	unsigned a, b, c;
>
> 	a = b = c = x;	
> 	c <<= 19;
> 	a += c;		/* a = (x << 19) + x */
> 	b <<= 9;
> 	b += a;		/* b = (x << 9) + a */
> 	c <<= 4;
> 	c += b;		/* c = (x << 23) + b */
> 	b <<= 5;
> 	b += c;		/* (b << 5) + c */
> 	b <<= 3;
> 	b += a;		/* ((b << 5) + c << 3) + a */
> 	b <<= 3;
> 	b -= c;		/* (((b << 5) + c << 3) + a << 3) - c */
> 	return x;
> }
>
> which gcc turns into:
> 	move.l	%d1,%d2		 4
> 	lsl.w	#3,%d2		12  16
> 	swap	%d2		 4  20
> 	clr.w	%d2		 4  24
> 	move.l	%d1,%a1		 4  28
> 	add.l	%d2,%a1		 8  36
> 	moveq	#9,%d0		 4  40
> 	lsl.l	%d0,%d1		26  66
> 	add.l	%a1,%d1		 8  74
> 	lsl.l	#4,%d2		16  90
> 	move.l	%d1,%a0		 4  94
> 	add.l	%d2,%a0		 8 102
> 	lsl.l	#5,%d1		18 120
> 	move.l	%a0,%d0		 4 124
> 	add.l	%d1,%d0		 8 132
> 	move.l	%d0,%d1		 4 136
> 	lsl.l	#3,%d1		14 150
> 	move.l	%a1,%d0		 4 154
> 	add.l	%d1,%d0		 8 162
> 	lsl.l	#3,%d0		14 176
> 	sub.l	%a0,%d0		 8 184
>
> I can shave a few cycles with hand-optimized code, but the result is
> barely faster than the much smaller multiply-based:
>
> 	; y = #k * x, with z as a temp
> 	move.w  #klow, y        ;  8
> 	move.w  x, z            ;  4     12
> 	swap    x               ;  4     16
> 	mulu.w  y, x            ; 52     68	(Using klow = 0x8647)
> 	mulu.w  z, y            ; 54    122	(Assuming 50% ones)
> 	mulu.w  #khigh, z       ; 54    176	(Using khigh = 0x61C8)
> 	swap    y               ;  4    180
> 	add.w   x, y            ;  4    184
> 	add.w   z, y            ;  4    188
> 	swap    y               ;  4    192
> 	;; 12 words
>
> Another way to divide it up is to compute xlow * khigh with a multiply,
> and x * klow with shifts and adds.  x * 0x8647 can be evauated as:
>
> 	unsigned a = (x << 9) + x;
> 	unsigned b = (x << 1) + a;
> 	unsigned c = (b << 1) + (a << 6) + a;
> 	return c + ((uint16_t)(x * 0x61C8) << 16);
>
> If I write it in the following form:
> 	register unsigned a, b;
>
> 	b = x << 1;
> 	asm("" : "=g" (a) : "0" (b));	/* "a = b, DAMN IT" */
> 	a <<= 8;
> 	a += x;
> 	b += a;
> 	b += b;
> 	b += a;
> 	a <<= 6;
> 	return a + b + ((uint16_t)(x * 0x61c8) << 16);
>
> The gcc can be coaxed into generating the desired code.
> (The "asm" is the only way I found to stop GCC from collapsing
> "(x << 1) << 8" to "x << 9", which is slower.)
>
> baz:
> 	move.l 4(%sp),%d1
> 	move.l %d1,%a0		 4
> 	add.l %d1,%a0		 8  12
> 	move.l %a0,%d0		 4  16
> 	lsl.l #8,%d0		24  40
> 	add.l %d1,%d0		 8  48
> 	add.l %d0,%a0		 8  56
> 	muls.w #25032,%d1	56 112
> 	swap %d1		 4 116
> 	clr.w %d1		 4 120
> 	add.l %d0,%d1		 8 128
> 	add.l %a0,%d1		 8 136
> 	add.l %a0,%d1		 8 144
> 	lsl.l #6,%d0		20 164
> 	add.l %d1,%d0		 8 172
> 	rts
>
> Well, that's a *little* bit faster.
>
>
>
> Alternatively, I could just invent a whole different hash function for
> the case.  One idea is to use two, rather than three, 16-bit multiplies.
> Split the input into two 16-bit halves, multiply them by separate
> constants, sum the results, and then swap the halves.
>
> 	move.w	x, y		 4
> 	swap	x		 4   8
> 	mulu.w	#K1, y		58  66
> 	mulu.w	#K2, x		58 124
> 	add.l	y, x		 8 132
> 	swap	x		 4 136
>
> The swap is important; hash_32() is documented as putting the "most mixed"
> bits in the most significant bits, as hash tables are indexed by the most
> significant bits.
>
> If you have any thoughts on the matter, they'd definitely be appreciated.
>

next prev parent reply	other threads:[~2016-05-12 13:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-11 10:24 [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize] George Spelvin
2016-05-11 12:38 ` Greg Ungerer
2016-05-12  8:04   ` George Spelvin
2016-05-12  8:35     ` Andreas Schwab
2016-05-12 13:14     ` Greg Ungerer [this message]
2016-05-12 12:46 ` Greg Ungerer
2016-05-12 20:52   ` George Spelvin
2016-05-13  1:07     ` Greg Ungerer
2016-05-13  2:36       ` George Spelvin
2016-05-13  6:45         ` Greg Ungerer
2016-05-13  9:02           ` George Spelvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=573481A2.4040504@linux-m68k.org \
    --to=gerg@linux-m68k.org \
    --cc=geert@linux-m68k.org \
    --cc=linux-m68k@lists.linux-m68k.org \
    --cc=linux@horizon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox