From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758652AbbJ3O2s (ORCPT ); Fri, 30 Oct 2015 10:28:48 -0400 Received: from us01smtprelay-2.synopsys.com ([198.182.60.111]:40239 "EHLO smtprelay.synopsys.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751103AbbJ3O2q (ORCPT ); Fri, 30 Oct 2015 10:28:46 -0400 From: Alexey Brodkin To: "nicolas.pitre@linaro.org" CC: "rmk+kernel@arm.linux.org.uk" , "linux-kernel@vger.kernel.org" , "Vineet.Gupta1@synopsys.com" , "shemminger@linux-foundation.org" , "mingo@elte.hu" , "linux-snps-arc@lists.infradead.org" , "davem@davemloft.net" Subject: Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches Thread-Topic: [PATCH] __div64_32: implement division by multiplication for 32-bit arches Thread-Index: AQHREdKqJ9hBuINsYEu79ImC2YEdpZ6BfXmAgAGyGoCAANqMAA== Date: Fri, 30 Oct 2015 14:28:39 +0000 Message-ID: <1446215318.4394.32.camel@synopsys.com> References: <1446072455-16074-1-git-send-email-abrodkin@synopsys.com> In-Reply-To: Accept-Language: en-US, ru-RU Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.225.15.89] Content-Type: text/plain; charset="utf-8" Content-ID: MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id t9UESrLS020886 Hi Nicolas, On Thu, 2015-10-29 at 21:26 -0400, Nicolas Pitre wrote: > On Wed, 28 Oct 2015, Nicolas Pitre wrote: > > > On Thu, 29 Oct 2015, Alexey Brodkin wrote: > > > > > Fortunately we already have much better __div64_32() for 32-bit ARM. > > > There in case of division by constant preprocessor calculates so-called > > > "magic number" which is later used in multiplications instead of divisions. > > > > It's not magic, it is science. :-) > > > > > It's really nice and very optimal but obviously works only for ARM > > > because ARM assembly is involved. > > > > > > Now why don't we extend the same approach to all other 32-bit arches > > > with multiplication part implemented in pure C. With good compiler > > > resulting assembly will be quite close to manually written assembly. > > Well... not as close at least on ARM. Maybe 2x to 3x more costly than > the one with assembly. Still better than 100x or so without this > optimization. Indeed even having that function 25 times faster instead of 100 times is already quite an achievement. For example that will already cure my iperf performance degradation: I'll see do_div() taking < 1% instead of > 10% now. My test source was: --------------------->8------------------------ int myfunc(u64 data) { return do_div(data, 1000); } --------------------->8------------------------ Now take a look at disassembly that I'm getting: --------------------->8------------------------ 0000062c : 62c: 19 28 86 0f 4f 8d 3b df mpydu r6,r0,0x8d4fdf3b 634: 00 26 86 8f 4f 8d 3b df add.f r6,r6,0x8d4fdf3b 63c: 02 27 87 0f ed 7c 69 91 sub r7,r7,0x7ced9169 644: c0 27 65 00 add.c r7,r7,1 648: 85 0e c4 71 12 83 97 6e brlo 0x83126e97,r7,6cc 650: 75 0f 80 0f 12 83 97 6e breq r7,0x83126e97,6c4 658: 0d 70 mov_s r8,0 65a: 2d 71 mov_s r9,1 65c: 19 29 8a 0f 4f 8d 3b df mpydu r10,r1,0x8d4fdf3b 664: 00 27 84 82 add.f r4,r7,r10 668: ac 70 mov_s r5,0 66a: 19 28 86 0f 12 83 97 6e mpydu r6,r0,0x83126e97 672: 01 25 c5 02 adc r5,r5,r11 676: 00 24 04 82 add.f r4,r4,r8 67a: 00 25 45 02 add r5,r5,r9 67e: c0 25 65 00 add.c r5,r5,1 682: 00 26 06 81 add.f r6,r6,r4 686: 01 27 47 01 adc r7,r7,r5 68a: 51 0f 44 01 brlo r7,r5,6d8 68e: 49 0d c0 01 breq r5,r7,6d4 692: 8c 70 mov_s r4,0 694: ac 70 mov_s r5,0 696: e0 42 mov_s r2,r7 698: 19 29 86 0f 12 83 97 6e mpydu r6,r1,0x83126e97 6a0: 00 22 82 81 add.f r2,r2,r6 6a4: 6c 70 mov_s r3,0 6a6: 01 23 c3 01 adc r3,r3,r7 6aa: 00 22 02 81 add.f r2,r2,r4 6ae: a0 73 add_s r3,r3,r5 6b0: c0 23 65 00 add.c r3,r3,1 6b4: 29 ba lsr_s r2,r2,9 6b6: 17 bb asl_s r3,r3,23 6b8: 65 7a or_s r2,r2,r3 6ba: 9a 22 0f 0a mpy r2,r2,0x3e8 6be: e0 7f j_s.d [blink] 6c0: 42 78 sub_s r0,r0,r2 6c2: e0 78 nop_s 6c4: 95 0e 85 f1 4f 8d 3a df brhs.nt 0x8d4fdf3a,r6,658 6cc: 0d 70 mov_s r8,0 6ce: 91 07 ef ff b.d 65c 6d2: 2d 70 mov_s r9,0 6d4: bf 0e 05 81 brhs.nt r6,r4,692 6d8: 8c 70 mov_s r4,0 6da: bf 07 ef ff b.d 696 6de: ac 71 mov_s r5,1 --------------------->8------------------------ What you see here is pretty straight-forward conversion to assembly of "run-time calculations". Things to note: [1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit that returns 64-bit result in register pair. [2] Indeed lots of moves and additions happen here. So my conclusion would be: [1] Proposed implementation makes perfect sense because already speeds-up do_div() significantly. [2] Ability to substitute "run-time calculations" on per-arch basis would be awsome because with few lines of assembly another 2-4 times of improvement could be achieved. -Alexey{.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I