Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Greg Ungerer <gerg@linux-m68k.org>
To: George Spelvin <linux@horizon.com>,
	geert@linux-m68k.org, linux-m68k@lists.linux-m68k.org
Subject: Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
Date: Thu, 12 May 2016 23:14:10 +1000	[thread overview]
Message-ID: <573481A2.4040504@linux-m68k.org> (raw)
In-Reply-To: <20160512080429.31536.qmail@ns.horizon.com>

Hi George,

On 12/05/16 18:04, George Spelvin wrote:
> Thanks for the response!
>
>> That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
>> was lb1sf68.asm in older versions of gcc). It has always had that
>> coldfire conditional code in it. Not sure why it was in there though.
>
> I went digging through a lot of ColdFire documentation to see if
> there were ever any that needed it, and apparently it's been in there
> since the beginning.
>
> It did occur to me that perhaps someone wanted to support tne
> intersection of all 68k family CPUs, so that meant only MULU.W
> and ADD.L.  But I can't see a gcc option to ask for that.
>
>
> Mostly I just thought that saving a few cycles was a nice win.

Yep, that is always nice.

After a few test builds I am pretty sure that we never need or
use mulsi3 on ColdFire. So I think we can move to not even bothering
to compile it for that case.


> This is tangential to some work I'm doing on the hash functions.
> in <linux/hash.h>, there are hash_32(x, bits) and hash_64(x, bits)
> which basically do "return (x * CONSTANT) >> (32 - bits)", or
> 64 - bits for the 64-bit version.
>
> The CONSTANTs were chosen to be efficient when computed using
> shift-and-add sequences, but that made them less effective hashes.
>
> The 64-bit constant was recently revised to avoid a bad case found when
> hashing page-aligned pointers, since it's a pointless optimization;
> all 64-bit processors have fast multipliers.  But 32 bits was left alone
> for fear of breaking performance.
>
> After some research (partly by searching for __mulsi3), I found that m68k
> is the only platform that doesn't have a fast multiplier.  SPARC used
> to be in that boat too, but SPARCv7 syupport was dropped a few years
> ago.
>
> That led to me optimizing the multiply-by-constant code for m68k.
>
>
>
> Do you have any interest in the matter?  Basically, the options are
> to optimize the implementation of the multiplication, or come up with
> a different hash function that works better in that environment.

It looks like only the original m68000 targets will benefit from
a faster mulsi3. But there is nothing wrong with making it faster
for that case in general.

I don't have any feel for how much of a performance problem the
hash functions are on m68000 though. Keep in mind that those are
all non-MMU platforms.

Regards
Greg


> The current multiply by 0x9e370001 compiles to, depending on compiler
> flags, either a call to __mulsi3 or (omitting prologue & epilogue,
> since it's designed as an in-line function):
>
> 	move.l	%d0,%d1		 4
> 	lsl.l	#6,%d1		20  24
> 	move.l	%d1,%d2		 4  28
> 	lsl.l	#3,%d2		14  42
> 	sub.l	%d1,%d2		 8  50
> 	sub.l	%d0,%d2		 8  58
> 	move.l	%d2,%d1		 4  62
> 	lsl.l	#3,%d1		14  76
> 	sub.l	%d2,%d1		 8  84
> 	lsl.l	#3,%d1		14  98
> 	add.l	%d0,%d1		 8 106
> 	swap	%d1		 4 110
> 	clr.w	%d1		 4 114
> 	sub.l	%d1,%d0		 8 122
>
> (The columns on the right are cycle counts.)
>
> Tangent: I'm surprised I can't coax it into the equivalent:
> 	move.w %d0,%d1		 4
> 	mulu.w #0x9e37,%d1	58  62
> 	add.l %d1,%d0		 8  70
>
> Anyway, the problem is the low Hamming weight of the low 16 bits,
> which has to be solved somehow.  The "more random" high 16 bits of the
> multiplier are only multiplied by the low 16 bits of the argument.
> The high 16 bits of the argument are just multiplied by 1.  If the
> argument is a page-aligned pointer, with 12 low-order bits zero, we have
> very poor mixing.
>
> Using the better constant 0x61C88647 (both this and the previous one are
> chosen to be close to phi, the golden ratio (sqrt(5)-1)/2, but this is
> the negative), gcc struggles.
>
> I searched for an efficient shift-and-add sequence,
> and came up with the 68000-optimized:
>
> /* Multiply by 0x61C88647 */
> unsigned hash_32(unsigned x)
> {
> 	unsigned a, b, c;
>
> 	a = b = c = x;	
> 	c <<= 19;
> 	a += c;		/* a = (x << 19) + x */
> 	b <<= 9;
> 	b += a;		/* b = (x << 9) + a */
> 	c <<= 4;
> 	c += b;		/* c = (x << 23) + b */
> 	b <<= 5;
> 	b += c;		/* (b << 5) + c */
> 	b <<= 3;
> 	b += a;		/* ((b << 5) + c << 3) + a */
> 	b <<= 3;
> 	b -= c;		/* (((b << 5) + c << 3) + a << 3) - c */
> 	return x;
> }
>
> which gcc turns into:
> 	move.l	%d1,%d2		 4
> 	lsl.w	#3,%d2		12  16
> 	swap	%d2		 4  20
> 	clr.w	%d2		 4  24
> 	move.l	%d1,%a1		 4  28
> 	add.l	%d2,%a1		 8  36
> 	moveq	#9,%d0		 4  40
> 	lsl.l	%d0,%d1		26  66
> 	add.l	%a1,%d1		 8  74
> 	lsl.l	#4,%d2		16  90
> 	move.l	%d1,%a0		 4  94
> 	add.l	%d2,%a0		 8 102
> 	lsl.l	#5,%d1		18 120
> 	move.l	%a0,%d0		 4 124
> 	add.l	%d1,%d0		 8 132
> 	move.l	%d0,%d1		 4 136
> 	lsl.l	#3,%d1		14 150
> 	move.l	%a1,%d0		 4 154
> 	add.l	%d1,%d0		 8 162
> 	lsl.l	#3,%d0		14 176
> 	sub.l	%a0,%d0		 8 184
>
> I can shave a few cycles with hand-optimized code, but the result is
> barely faster than the much smaller multiply-based:
>
> 	; y = #k * x, with z as a temp
> 	move.w  #klow, y        ;  8
> 	move.w  x, z            ;  4     12
> 	swap    x               ;  4     16
> 	mulu.w  y, x            ; 52     68	(Using klow = 0x8647)
> 	mulu.w  z, y            ; 54    122	(Assuming 50% ones)
> 	mulu.w  #khigh, z       ; 54    176	(Using khigh = 0x61C8)
> 	swap    y               ;  4    180
> 	add.w   x, y            ;  4    184
> 	add.w   z, y            ;  4    188
> 	swap    y               ;  4    192
> 	;; 12 words
>
> Another way to divide it up is to compute xlow * khigh with a multiply,
> and x * klow with shifts and adds.  x * 0x8647 can be evauated as:
>
> 	unsigned a = (x << 9) + x;
> 	unsigned b = (x << 1) + a;
> 	unsigned c = (b << 1) + (a << 6) + a;
> 	return c + ((uint16_t)(x * 0x61C8) << 16);
>
> If I write it in the following form:
> 	register unsigned a, b;
>
> 	b = x << 1;
> 	asm("" : "=g" (a) : "0" (b));	/* "a = b, DAMN IT" */
> 	a <<= 8;
> 	a += x;
> 	b += a;
> 	b += b;
> 	b += a;
> 	a <<= 6;
> 	return a + b + ((uint16_t)(x * 0x61c8) << 16);
>
> The gcc can be coaxed into generating the desired code.
> (The "asm" is the only way I found to stop GCC from collapsing
> "(x << 1) << 8" to "x << 9", which is slower.)
>
> baz:
> 	move.l 4(%sp),%d1
> 	move.l %d1,%a0		 4
> 	add.l %d1,%a0		 8  12
> 	move.l %a0,%d0		 4  16
> 	lsl.l #8,%d0		24  40
> 	add.l %d1,%d0		 8  48
> 	add.l %d0,%a0		 8  56
> 	muls.w #25032,%d1	56 112
> 	swap %d1		 4 116
> 	clr.w %d1		 4 120
> 	add.l %d0,%d1		 8 128
> 	add.l %a0,%d1		 8 136
> 	add.l %a0,%d1		 8 144
> 	lsl.l #6,%d0		20 164
> 	add.l %d1,%d0		 8 172
> 	rts
>
> Well, that's a *little* bit faster.
>
>
>
> Alternatively, I could just invent a whole different hash function for
> the case.  One idea is to use two, rather than three, 16-bit multiplies.
> Split the input into two 16-bit halves, multiply them by separate
> constants, sum the results, and then swap the halves.
>
> 	move.w	x, y		 4
> 	swap	x		 4   8
> 	mulu.w	#K1, y		58  66
> 	mulu.w	#K2, x		58 124
> 	add.l	y, x		 8 132
> 	swap	x		 4 136
>
> The swap is important; hash_32() is documented as putting the "most mixed"
> bits in the most significant bits, as hash tables are indexed by the most
> significant bits.
>
> If you have any thoughts on the matter, they'd definitely be appreciated.
>

next prev parent reply	other threads:[~2016-05-12 13:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-11 10:24 [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize] George Spelvin
2016-05-11 12:38 ` Greg Ungerer
2016-05-12  8:04   ` George Spelvin
2016-05-12  8:35     ` Andreas Schwab
2016-05-12 13:14     ` Greg Ungerer [this message]
2016-05-12 12:46 ` Greg Ungerer
2016-05-12 20:52   ` George Spelvin
2016-05-13  1:07     ` Greg Ungerer
2016-05-13  2:36       ` George Spelvin
2016-05-13  6:45         ` Greg Ungerer
2016-05-13  9:02           ` George Spelvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=573481A2.4040504@linux-m68k.org \
    --to=gerg@linux-m68k.org \
    --cc=geert@linux-m68k.org \
    --cc=linux-m68k@lists.linux-m68k.org \
    --cc=linux@horizon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.