All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey.Brodkin@synopsys.com (Alexey Brodkin)
To: linux-snps-arc@lists.infradead.org
Subject: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Fri, 30 Oct 2015 14:28:39 +0000	[thread overview]
Message-ID: <1446215318.4394.32.camel@synopsys.com> (raw)
In-Reply-To: <alpine.LFD.2.20.1510292112590.630@knanqh.ubzr>

Hi Nicolas,

On Thu, 2015-10-29@21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
> 
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> > 
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> > 
> > It's not magic, it is science.  :-)
> > 
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > > 
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
> 
> Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
> the one with assembly.  Still better than 100x or so without this 
> optimization.

Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.

My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
	return do_div(data, 1000);
}
--------------------->8------------------------

Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
 62c:	19 28 86 0f 4f 8d 3b df 	mpydu      r6,r0,0x8d4fdf3b
 634:	00 26 86 8f 4f 8d 3b df 	add.f      r6,r6,0x8d4fdf3b
 63c:	02 27 87 0f ed 7c 69 91 	sub        r7,r7,0x7ced9169
 644:	c0 27 65 00             	add.c      r7,r7,1
 648:	85 0e c4 71 12 83 97 6e 	brlo       0x83126e97,r7,6cc <myfunc+0xa0>

 650:	75 0f 80 0f 12 83 97 6e 	breq       r7,0x83126e97,6c4 <myfunc+0x98>

 658:	0d 70                   	mov_s      r8,0
 65a:	2d 71                   	mov_s      r9,1
 65c:	19 29 8a 0f 4f 8d 3b df 	mpydu      r10,r1,0x8d4fdf3b
 664:	00 27 84 82             	add.f      r4,r7,r10
 668:	ac 70                   	mov_s      r5,0
 66a:	19 28 86 0f 12 83 97 6e 	mpydu      r6,r0,0x83126e97
 672:	01 25 c5 02             	adc        r5,r5,r11
 676:	00 24 04 82             	add.f      r4,r4,r8
 67a:	00 25 45 02             	add        r5,r5,r9
 67e:	c0 25 65 00             	add.c      r5,r5,1
 682:	00 26 06 81             	add.f      r6,r6,r4
 686:	01 27 47 01             	adc        r7,r7,r5
 68a:	51 0f 44 01             	brlo       r7,r5,6d8 <myfunc+0xac>

 68e:	49 0d c0 01             	breq       r5,r7,6d4 <myfunc+0xa8>

 692:	8c 70                   	mov_s      r4,0
 694:	ac 70                   	mov_s      r5,0
 696:	e0 42                   	mov_s      r2,r7
 698:	19 29 86 0f 12 83 97 6e 	mpydu      r6,r1,0x83126e97
 6a0:	00 22 82 81             	add.f      r2,r2,r6
 6a4:	6c 70                   	mov_s      r3,0
 6a6:	01 23 c3 01             	adc        r3,r3,r7
 6aa:	00 22 02 81             	add.f      r2,r2,r4
 6ae:	a0 73                   	add_s      r3,r3,r5
 6b0:	c0 23 65 00             	add.c      r3,r3,1
 6b4:	29 ba                   	lsr_s      r2,r2,9
 6b6:	17 bb                   	asl_s      r3,r3,23
 6b8:	65 7a                   	or_s       r2,r2,r3
 6ba:	9a 22 0f 0a             	mpy        r2,r2,0x3e8
 6be:	e0 7f                   	j_s.d      [blink]
 6c0:	42 78                   	sub_s      r0,r0,r2
 6c2:	e0 78                   	nop_s      
 6c4:	95 0e 85 f1 4f 8d 3a df 	brhs.nt    0x8d4fdf3a,r6,658 <myfunc+0x2c>

 6cc:	0d 70                   	mov_s      r8,0
 6ce:	91 07 ef ff             	b.d        65c <myfunc+0x30>

 6d2:	2d 70                   	mov_s      r9,0
 6d4:	bf 0e 05 81             	brhs.nt    r6,r4,692 <myfunc+0x66>

 6d8:	8c 70                   	mov_s      r4,0
 6da:	bf 07 ef ff             	b.d        696 <myfunc+0x6a>

 6de:	ac 71                   	mov_s      r5,1
--------------------->8------------------------

What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
 [1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
     that returns 64-bit result in register pair.

 [2] Indeed lots of moves and additions happen here.

So my conclusion would be:
 [1] Proposed implementation makes perfect sense because already speeds-up do_div()
     significantly.

 [2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
     because with few lines of assembly another 2-4 times of improvement could be
     achieved.

-Alexey

WARNING: multiple messages have this Message-ID (diff)
From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
To: "nicolas.pitre@linaro.org" <nicolas.pitre@linaro.org>
Cc: "rmk+kernel@arm.linux.org.uk" <rmk+kernel@arm.linux.org.uk>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Vineet.Gupta1@synopsys.com" <Vineet.Gupta1@synopsys.com>,
	"shemminger@linux-foundation.org"
	<shemminger@linux-foundation.org>,
	"mingo@elte.hu" <mingo@elte.hu>,
	"linux-snps-arc@lists.infradead.org" 
	<linux-snps-arc@lists.infradead.org>,
	"davem@davemloft.net" <davem@davemloft.net>
Subject: Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Fri, 30 Oct 2015 14:28:39 +0000	[thread overview]
Message-ID: <1446215318.4394.32.camel@synopsys.com> (raw)
In-Reply-To: <alpine.LFD.2.20.1510292112590.630@knanqh.ubzr>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4738 bytes --]

Hi Nicolas,

On Thu, 2015-10-29 at 21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
> 
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> > 
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> > 
> > It's not magic, it is science.  :-)
> > 
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > > 
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
> 
> Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
> the one with assembly.  Still better than 100x or so without this 
> optimization.

Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.

My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
	return do_div(data, 1000);
}
--------------------->8------------------------

Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
 62c:	19 28 86 0f 4f 8d 3b df 	mpydu      r6,r0,0x8d4fdf3b
 634:	00 26 86 8f 4f 8d 3b df 	add.f      r6,r6,0x8d4fdf3b
 63c:	02 27 87 0f ed 7c 69 91 	sub        r7,r7,0x7ced9169
 644:	c0 27 65 00             	add.c      r7,r7,1
 648:	85 0e c4 71 12 83 97 6e 	brlo       0x83126e97,r7,6cc <myfunc+0xa0>

 650:	75 0f 80 0f 12 83 97 6e 	breq       r7,0x83126e97,6c4 <myfunc+0x98>

 658:	0d 70                   	mov_s      r8,0
 65a:	2d 71                   	mov_s      r9,1
 65c:	19 29 8a 0f 4f 8d 3b df 	mpydu      r10,r1,0x8d4fdf3b
 664:	00 27 84 82             	add.f      r4,r7,r10
 668:	ac 70                   	mov_s      r5,0
 66a:	19 28 86 0f 12 83 97 6e 	mpydu      r6,r0,0x83126e97
 672:	01 25 c5 02             	adc        r5,r5,r11
 676:	00 24 04 82             	add.f      r4,r4,r8
 67a:	00 25 45 02             	add        r5,r5,r9
 67e:	c0 25 65 00             	add.c      r5,r5,1
 682:	00 26 06 81             	add.f      r6,r6,r4
 686:	01 27 47 01             	adc        r7,r7,r5
 68a:	51 0f 44 01             	brlo       r7,r5,6d8 <myfunc+0xac>

 68e:	49 0d c0 01             	breq       r5,r7,6d4 <myfunc+0xa8>

 692:	8c 70                   	mov_s      r4,0
 694:	ac 70                   	mov_s      r5,0
 696:	e0 42                   	mov_s      r2,r7
 698:	19 29 86 0f 12 83 97 6e 	mpydu      r6,r1,0x83126e97
 6a0:	00 22 82 81             	add.f      r2,r2,r6
 6a4:	6c 70                   	mov_s      r3,0
 6a6:	01 23 c3 01             	adc        r3,r3,r7
 6aa:	00 22 02 81             	add.f      r2,r2,r4
 6ae:	a0 73                   	add_s      r3,r3,r5
 6b0:	c0 23 65 00             	add.c      r3,r3,1
 6b4:	29 ba                   	lsr_s      r2,r2,9
 6b6:	17 bb                   	asl_s      r3,r3,23
 6b8:	65 7a                   	or_s       r2,r2,r3
 6ba:	9a 22 0f 0a             	mpy        r2,r2,0x3e8
 6be:	e0 7f                   	j_s.d      [blink]
 6c0:	42 78                   	sub_s      r0,r0,r2
 6c2:	e0 78                   	nop_s      
 6c4:	95 0e 85 f1 4f 8d 3a df 	brhs.nt    0x8d4fdf3a,r6,658 <myfunc+0x2c>

 6cc:	0d 70                   	mov_s      r8,0
 6ce:	91 07 ef ff             	b.d        65c <myfunc+0x30>

 6d2:	2d 70                   	mov_s      r9,0
 6d4:	bf 0e 05 81             	brhs.nt    r6,r4,692 <myfunc+0x66>

 6d8:	8c 70                   	mov_s      r4,0
 6da:	bf 07 ef ff             	b.d        696 <myfunc+0x6a>

 6de:	ac 71                   	mov_s      r5,1
--------------------->8------------------------

What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
 [1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
     that returns 64-bit result in register pair.

 [2] Indeed lots of moves and additions happen here.

So my conclusion would be:
 [1] Proposed implementation makes perfect sense because already speeds-up do_div()
     significantly.

 [2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
     because with few lines of assembly another 2-4 times of improvement could be
     achieved.

-Alexeyÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

  parent reply	other threads:[~2015-10-30 14:28 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 22:47 ` Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-28 23:32   ` Nicolas Pitre
2015-10-29  7:34   ` Alexey Brodkin
2015-10-29  7:34     ` Alexey Brodkin
2015-10-30  1:26   ` Nicolas Pitre
2015-10-30  1:26     ` Nicolas Pitre
2015-10-30  5:41     ` Vineet Gupta
2015-10-30  5:41       ` Vineet Gupta
2015-10-30 12:41       ` Måns Rullgård
2015-10-30 12:41         ` Måns Rullgård
2015-10-30 12:40     ` Måns Rullgård
2015-10-30 12:40       ` Måns Rullgård
2015-10-30 15:17       ` Nicolas Pitre
2015-10-30 15:17         ` Nicolas Pitre
2015-10-30 15:54         ` Alexey Brodkin
2015-10-30 15:54           ` Alexey Brodkin
2015-10-30 16:55           ` Nicolas Pitre
2015-10-30 16:55             ` Nicolas Pitre
2015-10-30 17:45             ` Måns Rullgård
2015-10-30 17:45               ` Måns Rullgård
2015-11-04 23:46               ` Nicolas Pitre
2015-11-04 23:46                 ` Nicolas Pitre
2015-11-04 23:48                 ` Nicolas Pitre
2015-11-04 23:48                   ` Nicolas Pitre
2015-11-05  3:13                   ` Vineet Gupta
2015-11-05  3:13                     ` Vineet Gupta
2015-11-05  5:06                     ` Nicolas Pitre
2015-11-05  5:06                       ` Nicolas Pitre
2015-11-04 23:49                 ` Måns Rullgård
2015-11-04 23:49                   ` Måns Rullgård
2015-10-30 14:28     ` Alexey Brodkin [this message]
2015-10-30 14:28       ` Alexey Brodkin
2015-10-29  0:36 ` kbuild test robot
2015-10-29  0:36   ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 12:52   ` Måns Rullgård
2015-10-29 13:05   ` Alexey Brodkin
2015-10-29 13:05     ` Alexey Brodkin
2015-10-29 13:37     ` Måns Rullgård
2015-10-29 13:37       ` Måns Rullgård
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 13:31   ` Russell King - ARM Linux
2015-10-29 14:32   ` Alexey Brodkin
2015-10-29 14:32     ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap
2015-10-29 17:09   ` Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1446215318.4394.32.camel@synopsys.com \
    --to=alexey.brodkin@synopsys.com \
    --cc=linux-snps-arc@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.