Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
To: "nicolas.pitre@linaro.org" <nicolas.pitre@linaro.org>
Cc: "rmk+kernel@arm.linux.org.uk" <rmk+kernel@arm.linux.org.uk>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Vineet.Gupta1@synopsys.com" <Vineet.Gupta1@synopsys.com>,
	"shemminger@linux-foundation.org"
	<shemminger@linux-foundation.org>,
	"mingo@elte.hu" <mingo@elte.hu>,
	"linux-snps-arc@lists.infradead.org" 
	<linux-snps-arc@lists.infradead.org>,
	"davem@davemloft.net" <davem@davemloft.net>
Subject: Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Fri, 30 Oct 2015 14:28:39 +0000	[thread overview]
Message-ID: <1446215318.4394.32.camel@synopsys.com> (raw)
In-Reply-To: <alpine.LFD.2.20.1510292112590.630@knanqh.ubzr>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4738 bytes --]

Hi Nicolas,

On Thu, 2015-10-29 at 21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
> 
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> > 
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> > 
> > It's not magic, it is science.  :-)
> > 
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > > 
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
> 
> Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
> the one with assembly.  Still better than 100x or so without this 
> optimization.

Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.

My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
	return do_div(data, 1000);
}
--------------------->8------------------------

Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
 62c:	19 28 86 0f 4f 8d 3b df 	mpydu      r6,r0,0x8d4fdf3b
 634:	00 26 86 8f 4f 8d 3b df 	add.f      r6,r6,0x8d4fdf3b
 63c:	02 27 87 0f ed 7c 69 91 	sub        r7,r7,0x7ced9169
 644:	c0 27 65 00             	add.c      r7,r7,1
 648:	85 0e c4 71 12 83 97 6e 	brlo       0x83126e97,r7,6cc <myfunc+0xa0>

 650:	75 0f 80 0f 12 83 97 6e 	breq       r7,0x83126e97,6c4 <myfunc+0x98>

 658:	0d 70                   	mov_s      r8,0
 65a:	2d 71                   	mov_s      r9,1
 65c:	19 29 8a 0f 4f 8d 3b df 	mpydu      r10,r1,0x8d4fdf3b
 664:	00 27 84 82             	add.f      r4,r7,r10
 668:	ac 70                   	mov_s      r5,0
 66a:	19 28 86 0f 12 83 97 6e 	mpydu      r6,r0,0x83126e97
 672:	01 25 c5 02             	adc        r5,r5,r11
 676:	00 24 04 82             	add.f      r4,r4,r8
 67a:	00 25 45 02             	add        r5,r5,r9
 67e:	c0 25 65 00             	add.c      r5,r5,1
 682:	00 26 06 81             	add.f      r6,r6,r4
 686:	01 27 47 01             	adc        r7,r7,r5
 68a:	51 0f 44 01             	brlo       r7,r5,6d8 <myfunc+0xac>

 68e:	49 0d c0 01             	breq       r5,r7,6d4 <myfunc+0xa8>

 692:	8c 70                   	mov_s      r4,0
 694:	ac 70                   	mov_s      r5,0
 696:	e0 42                   	mov_s      r2,r7
 698:	19 29 86 0f 12 83 97 6e 	mpydu      r6,r1,0x83126e97
 6a0:	00 22 82 81             	add.f      r2,r2,r6
 6a4:	6c 70                   	mov_s      r3,0
 6a6:	01 23 c3 01             	adc        r3,r3,r7
 6aa:	00 22 02 81             	add.f      r2,r2,r4
 6ae:	a0 73                   	add_s      r3,r3,r5
 6b0:	c0 23 65 00             	add.c      r3,r3,1
 6b4:	29 ba                   	lsr_s      r2,r2,9
 6b6:	17 bb                   	asl_s      r3,r3,23
 6b8:	65 7a                   	or_s       r2,r2,r3
 6ba:	9a 22 0f 0a             	mpy        r2,r2,0x3e8
 6be:	e0 7f                   	j_s.d      [blink]
 6c0:	42 78                   	sub_s      r0,r0,r2
 6c2:	e0 78                   	nop_s      
 6c4:	95 0e 85 f1 4f 8d 3a df 	brhs.nt    0x8d4fdf3a,r6,658 <myfunc+0x2c>

 6cc:	0d 70                   	mov_s      r8,0
 6ce:	91 07 ef ff             	b.d        65c <myfunc+0x30>

 6d2:	2d 70                   	mov_s      r9,0
 6d4:	bf 0e 05 81             	brhs.nt    r6,r4,692 <myfunc+0x66>

 6d8:	8c 70                   	mov_s      r4,0
 6da:	bf 07 ef ff             	b.d        696 <myfunc+0x6a>

 6de:	ac 71                   	mov_s      r5,1
--------------------->8------------------------

What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
 [1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
     that returns 64-bit result in register pair.

 [2] Indeed lots of moves and additions happen here.

So my conclusion would be:
 [1] Proposed implementation makes perfect sense because already speeds-up do_div()
     significantly.

 [2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
     because with few lines of assembly another 2-4 times of improvement could be
     achieved.

-Alexeyÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

next prev parent reply	other threads:[~2015-10-30 14:28 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-29  7:34   ` Alexey Brodkin
2015-10-30  1:26   ` Nicolas Pitre
2015-10-30  5:41     ` Vineet Gupta
2015-10-30 12:41       ` Måns Rullgård
2015-10-30 12:40     ` Måns Rullgård
2015-10-30 15:17       ` Nicolas Pitre
2015-10-30 15:54         ` Alexey Brodkin
2015-10-30 16:55           ` Nicolas Pitre
2015-10-30 17:45             ` Måns Rullgård
2015-11-04 23:46               ` Nicolas Pitre
2015-11-04 23:48                 ` Nicolas Pitre
2015-11-05  3:13                   ` Vineet Gupta
2015-11-05  5:06                     ` Nicolas Pitre
2015-11-04 23:49                 ` Måns Rullgård
2015-10-30 14:28     ` Alexey Brodkin [this message]
2015-10-29  0:36 ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 13:05   ` Alexey Brodkin
2015-10-29 13:37     ` Måns Rullgård
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 14:32   ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1446215318.4394.32.camel@synopsys.com \
    --to=alexey.brodkin@synopsys.com \
    --cc=Vineet.Gupta1@synopsys.com \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-snps-arc@lists.infradead.org \
    --cc=mingo@elte.hu \
    --cc=nicolas.pitre@linaro.org \
    --cc=rmk+kernel@arm.linux.org.uk \
    --cc=shemminger@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox