From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751844AbbJ2NFx (ORCPT ); Thu, 29 Oct 2015 09:05:53 -0400 Received: from us01smtprelay-2.synopsys.com ([198.182.60.111]:41533 "EHLO smtprelay.synopsys.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750935AbbJ2NFw (ORCPT ); Thu, 29 Oct 2015 09:05:52 -0400 From: Alexey Brodkin To: "mans@mansr.com" CC: "shemminger@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "Vineet.Gupta1@synopsys.com" , "linux-snps-arc@lists.infradead.org" , "rmk+kernel@arm.linux.org.uk" , "davem@davemloft.net" , "mingo@elte.hu" , "nico@cam.org" Subject: Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches Thread-Topic: [PATCH] __div64_32: implement division by multiplication for 32-bit arches Thread-Index: AQHREdKqJ9hBuINsYEu79ImC2YEdpZ6Cbaug///y9gA= Date: Thu, 29 Oct 2015 13:05:44 +0000 Message-ID: <1446123944.3203.8.camel@synopsys.com> References: <1446072455-16074-1-git-send-email-abrodkin@synopsys.com> In-Reply-To: Accept-Language: en-US, ru-RU Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.121.3.41] Content-Type: text/plain; charset="utf-8" Content-ID: MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id t9TD5wf4014188 Hi mans, On Thu, 2015-10-29 at 12:52 +0000, Måns Rullgård wrote: > Alexey Brodkin writes: > > > Existing default implementation of __div64_32() for 32-bit arches unfolds > > into huge routine with tons of arithmetics like +, -, * and all of them > > in loops. That leads to obvious performance degradation if do_div() is > > frequently used. > > > > Good example is extensive TCP/IP traffic. > > That's what I'm getting with perf out of iperf3: > > -------------->8-------------- > > 30.05% iperf3 [kernel.kallsyms] [k] copy_from_iter > > 11.77% iperf3 [kernel.kallsyms] [k] __div64_32 > > 5.44% iperf3 [kernel.kallsyms] [k] memset > > 5.32% iperf3 [kernel.kallsyms] [k] stmmac_xmit > > 2.70% iperf3 [kernel.kallsyms] [k] skb_segment > > 2.56% iperf3 [kernel.kallsyms] [k] tcp_ack > > -------------->8-------------- > > > > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds > > received from local_clock() to microseconds used in timestamp. > > BTW conversion itself is as simple as "/=1000". > > > > Fortunately we already have much better __div64_32() for 32-bit ARM. > > There in case of division by constant preprocessor calculates so-called > > "magic number" which is later used in multiplications instead of divisions. > > It's really nice and very optimal but obviously works only for ARM > > because ARM assembly is involved. > > > > Now why don't we extend the same approach to all other 32-bit arches > > with multiplication part implemented in pure C. With good compiler > > resulting assembly will be quite close to manually written assembly. > > > > And that change implements that. > > > > But there's at least 1 problem which I don't know how to solve. > > Preprocessor magic only happens if __div64_32() is inlined (that's > > obvious - preprocessor has to know if divider is constant or not). > > > > But __div64_32() is already marked as weak function (which in its turn > > is required to allow some architectures to provide its own optimal > > implementations). I.e. addition of "inline" for __div64_32() is not an > > option. > > > > So I do want to hear opinions on how to proceed with that patch. > > Indeed there's the simplest solution - use this implementation only in > > my architecture of preference (read ARC) but IMHO this change may > > benefit other architectures as well. > > I tried something similar for MIPS a while ago after noticing a similar > perf report. Adapting Nico's ARM code gave some nice speedups, but only > when I used MIPS assembly for the long multiplies. Apparently gcc is > still too stupid to do the sane thing. Could you please elaborate a little bit on what was a problem with gcc compared to hand-written asm? The point is if preprocessor does proper constant propagation then compiler will need to implement only calculations marked "run-time calculations". And in its turn those are pretty straight-forward 32-bit + and *. And at least on ARC I saw with that change perf no longer captures __div64_32() during iperf and iperf results itself improved for about 10%. So I'd say advantage is quite noticeable. -Alexey {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I