From: Alexey.Brodkin@synopsys.com (Alexey Brodkin)
To: linux-snps-arc@lists.infradead.org
Subject: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Fri, 30 Oct 2015 14:28:39 +0000 [thread overview]
Message-ID: <1446215318.4394.32.camel@synopsys.com> (raw)
In-Reply-To: <alpine.LFD.2.20.1510292112590.630@knanqh.ubzr>
Hi Nicolas,
On Thu, 2015-10-29@21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> >
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> >
> > It's not magic, it is science. :-)
> >
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > >
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
>
> Well... not as close at least on ARM. Maybe 2x to 3x more costly than
> the one with assembly. Still better than 100x or so without this
> optimization.
Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.
My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
return do_div(data, 1000);
}
--------------------->8------------------------
Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
62c: 19 28 86 0f 4f 8d 3b df mpydu r6,r0,0x8d4fdf3b
634: 00 26 86 8f 4f 8d 3b df add.f r6,r6,0x8d4fdf3b
63c: 02 27 87 0f ed 7c 69 91 sub r7,r7,0x7ced9169
644: c0 27 65 00 add.c r7,r7,1
648: 85 0e c4 71 12 83 97 6e brlo 0x83126e97,r7,6cc <myfunc+0xa0>
650: 75 0f 80 0f 12 83 97 6e breq r7,0x83126e97,6c4 <myfunc+0x98>
658: 0d 70 mov_s r8,0
65a: 2d 71 mov_s r9,1
65c: 19 29 8a 0f 4f 8d 3b df mpydu r10,r1,0x8d4fdf3b
664: 00 27 84 82 add.f r4,r7,r10
668: ac 70 mov_s r5,0
66a: 19 28 86 0f 12 83 97 6e mpydu r6,r0,0x83126e97
672: 01 25 c5 02 adc r5,r5,r11
676: 00 24 04 82 add.f r4,r4,r8
67a: 00 25 45 02 add r5,r5,r9
67e: c0 25 65 00 add.c r5,r5,1
682: 00 26 06 81 add.f r6,r6,r4
686: 01 27 47 01 adc r7,r7,r5
68a: 51 0f 44 01 brlo r7,r5,6d8 <myfunc+0xac>
68e: 49 0d c0 01 breq r5,r7,6d4 <myfunc+0xa8>
692: 8c 70 mov_s r4,0
694: ac 70 mov_s r5,0
696: e0 42 mov_s r2,r7
698: 19 29 86 0f 12 83 97 6e mpydu r6,r1,0x83126e97
6a0: 00 22 82 81 add.f r2,r2,r6
6a4: 6c 70 mov_s r3,0
6a6: 01 23 c3 01 adc r3,r3,r7
6aa: 00 22 02 81 add.f r2,r2,r4
6ae: a0 73 add_s r3,r3,r5
6b0: c0 23 65 00 add.c r3,r3,1
6b4: 29 ba lsr_s r2,r2,9
6b6: 17 bb asl_s r3,r3,23
6b8: 65 7a or_s r2,r2,r3
6ba: 9a 22 0f 0a mpy r2,r2,0x3e8
6be: e0 7f j_s.d [blink]
6c0: 42 78 sub_s r0,r0,r2
6c2: e0 78 nop_s
6c4: 95 0e 85 f1 4f 8d 3a df brhs.nt 0x8d4fdf3a,r6,658 <myfunc+0x2c>
6cc: 0d 70 mov_s r8,0
6ce: 91 07 ef ff b.d 65c <myfunc+0x30>
6d2: 2d 70 mov_s r9,0
6d4: bf 0e 05 81 brhs.nt r6,r4,692 <myfunc+0x66>
6d8: 8c 70 mov_s r4,0
6da: bf 07 ef ff b.d 696 <myfunc+0x6a>
6de: ac 71 mov_s r5,1
--------------------->8------------------------
What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
[1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
that returns 64-bit result in register pair.
[2] Indeed lots of moves and additions happen here.
So my conclusion would be:
[1] Proposed implementation makes perfect sense because already speeds-up do_div()
significantly.
[2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
because with few lines of assembly another 2-4 times of improvement could be
achieved.
-Alexey
WARNING: multiple messages have this Message-ID (diff)
From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
To: "nicolas.pitre@linaro.org" <nicolas.pitre@linaro.org>
Cc: "rmk+kernel@arm.linux.org.uk" <rmk+kernel@arm.linux.org.uk>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"Vineet.Gupta1@synopsys.com" <Vineet.Gupta1@synopsys.com>,
"shemminger@linux-foundation.org"
<shemminger@linux-foundation.org>,
"mingo@elte.hu" <mingo@elte.hu>,
"linux-snps-arc@lists.infradead.org"
<linux-snps-arc@lists.infradead.org>,
"davem@davemloft.net" <davem@davemloft.net>
Subject: Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Fri, 30 Oct 2015 14:28:39 +0000 [thread overview]
Message-ID: <1446215318.4394.32.camel@synopsys.com> (raw)
In-Reply-To: <alpine.LFD.2.20.1510292112590.630@knanqh.ubzr>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4738 bytes --]
Hi Nicolas,
On Thu, 2015-10-29 at 21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> >
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> >
> > It's not magic, it is science. :-)
> >
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > >
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
>
> Well... not as close at least on ARM. Maybe 2x to 3x more costly than
> the one with assembly. Still better than 100x or so without this
> optimization.
Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.
My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
return do_div(data, 1000);
}
--------------------->8------------------------
Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
62c: 19 28 86 0f 4f 8d 3b df mpydu r6,r0,0x8d4fdf3b
634: 00 26 86 8f 4f 8d 3b df add.f r6,r6,0x8d4fdf3b
63c: 02 27 87 0f ed 7c 69 91 sub r7,r7,0x7ced9169
644: c0 27 65 00 add.c r7,r7,1
648: 85 0e c4 71 12 83 97 6e brlo 0x83126e97,r7,6cc <myfunc+0xa0>
650: 75 0f 80 0f 12 83 97 6e breq r7,0x83126e97,6c4 <myfunc+0x98>
658: 0d 70 mov_s r8,0
65a: 2d 71 mov_s r9,1
65c: 19 29 8a 0f 4f 8d 3b df mpydu r10,r1,0x8d4fdf3b
664: 00 27 84 82 add.f r4,r7,r10
668: ac 70 mov_s r5,0
66a: 19 28 86 0f 12 83 97 6e mpydu r6,r0,0x83126e97
672: 01 25 c5 02 adc r5,r5,r11
676: 00 24 04 82 add.f r4,r4,r8
67a: 00 25 45 02 add r5,r5,r9
67e: c0 25 65 00 add.c r5,r5,1
682: 00 26 06 81 add.f r6,r6,r4
686: 01 27 47 01 adc r7,r7,r5
68a: 51 0f 44 01 brlo r7,r5,6d8 <myfunc+0xac>
68e: 49 0d c0 01 breq r5,r7,6d4 <myfunc+0xa8>
692: 8c 70 mov_s r4,0
694: ac 70 mov_s r5,0
696: e0 42 mov_s r2,r7
698: 19 29 86 0f 12 83 97 6e mpydu r6,r1,0x83126e97
6a0: 00 22 82 81 add.f r2,r2,r6
6a4: 6c 70 mov_s r3,0
6a6: 01 23 c3 01 adc r3,r3,r7
6aa: 00 22 02 81 add.f r2,r2,r4
6ae: a0 73 add_s r3,r3,r5
6b0: c0 23 65 00 add.c r3,r3,1
6b4: 29 ba lsr_s r2,r2,9
6b6: 17 bb asl_s r3,r3,23
6b8: 65 7a or_s r2,r2,r3
6ba: 9a 22 0f 0a mpy r2,r2,0x3e8
6be: e0 7f j_s.d [blink]
6c0: 42 78 sub_s r0,r0,r2
6c2: e0 78 nop_s
6c4: 95 0e 85 f1 4f 8d 3a df brhs.nt 0x8d4fdf3a,r6,658 <myfunc+0x2c>
6cc: 0d 70 mov_s r8,0
6ce: 91 07 ef ff b.d 65c <myfunc+0x30>
6d2: 2d 70 mov_s r9,0
6d4: bf 0e 05 81 brhs.nt r6,r4,692 <myfunc+0x66>
6d8: 8c 70 mov_s r4,0
6da: bf 07 ef ff b.d 696 <myfunc+0x6a>
6de: ac 71 mov_s r5,1
--------------------->8------------------------
What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
[1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
that returns 64-bit result in register pair.
[2] Indeed lots of moves and additions happen here.
So my conclusion would be:
[1] Proposed implementation makes perfect sense because already speeds-up do_div()
significantly.
[2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
because with few lines of assembly another 2-4 times of improvement could be
achieved.
-Alexeyÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
next prev parent reply other threads:[~2015-10-30 14:28 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 22:47 ` Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-28 23:32 ` Nicolas Pitre
2015-10-29 7:34 ` Alexey Brodkin
2015-10-29 7:34 ` Alexey Brodkin
2015-10-30 1:26 ` Nicolas Pitre
2015-10-30 1:26 ` Nicolas Pitre
2015-10-30 5:41 ` Vineet Gupta
2015-10-30 5:41 ` Vineet Gupta
2015-10-30 12:41 ` Måns Rullgård
2015-10-30 12:41 ` Måns Rullgård
2015-10-30 12:40 ` Måns Rullgård
2015-10-30 12:40 ` Måns Rullgård
2015-10-30 15:17 ` Nicolas Pitre
2015-10-30 15:17 ` Nicolas Pitre
2015-10-30 15:54 ` Alexey Brodkin
2015-10-30 15:54 ` Alexey Brodkin
2015-10-30 16:55 ` Nicolas Pitre
2015-10-30 16:55 ` Nicolas Pitre
2015-10-30 17:45 ` Måns Rullgård
2015-10-30 17:45 ` Måns Rullgård
2015-11-04 23:46 ` Nicolas Pitre
2015-11-04 23:46 ` Nicolas Pitre
2015-11-04 23:48 ` Nicolas Pitre
2015-11-04 23:48 ` Nicolas Pitre
2015-11-05 3:13 ` Vineet Gupta
2015-11-05 3:13 ` Vineet Gupta
2015-11-05 5:06 ` Nicolas Pitre
2015-11-05 5:06 ` Nicolas Pitre
2015-11-04 23:49 ` Måns Rullgård
2015-11-04 23:49 ` Måns Rullgård
2015-10-30 14:28 ` Alexey Brodkin [this message]
2015-10-30 14:28 ` Alexey Brodkin
2015-10-29 0:36 ` kbuild test robot
2015-10-29 0:36 ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 13:05 ` Alexey Brodkin
2015-10-29 13:05 ` Alexey Brodkin
2015-10-29 13:37 ` Måns Rullgård
2015-10-29 13:37 ` Måns Rullgård
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 14:32 ` Alexey Brodkin
2015-10-29 14:32 ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap
2015-10-29 17:09 ` Randy Dunlap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1446215318.4394.32.camel@synopsys.com \
--to=alexey.brodkin@synopsys.com \
--cc=linux-snps-arc@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.