All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey.Brodkin@synopsys.com (Alexey Brodkin)
To: linux-snps-arc@lists.infradead.org
Subject: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Thu, 29 Oct 2015 13:05:44 +0000	[thread overview]
Message-ID: <1446123944.3203.8.camel@synopsys.com> (raw)
In-Reply-To: <yw1xr3kddegu.fsf@unicorn.mansr.com>

Hi mans,

On Thu, 2015-10-29@12:52 +0000, M?ns Rullg?rd wrote:
> Alexey Brodkin <Alexey.Brodkin at synopsys.com> writes:
> 
> > Existing default implementation of __div64_32() for 32-bit arches unfolds
> > into huge routine with tons of arithmetics like +, -, * and all of them
> > in loops. That leads to obvious performance degradation if do_div() is
> > frequently used.
> > 
> > Good example is extensive TCP/IP traffic.
> > That's what I'm getting with perf out of iperf3:
> >  -------------->8--------------
> >     30.05%  iperf3   [kernel.kallsyms]        [k] copy_from_iter
> >     11.77%  iperf3   [kernel.kallsyms]        [k] __div64_32
> >      5.44%  iperf3   [kernel.kallsyms]        [k] memset
> >      5.32%  iperf3   [kernel.kallsyms]        [k] stmmac_xmit
> >      2.70%  iperf3   [kernel.kallsyms]        [k] skb_segment
> >      2.56%  iperf3   [kernel.kallsyms]        [k] tcp_ack
> >  -------------->8--------------
> > 
> > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
> > received from local_clock() to microseconds used in timestamp.
> > BTW conversion itself is as simple as "/=1000".
> > 
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> > 
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
> > 
> > And that change implements that.
> > 
> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> > 
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> > 
> > So I do want to hear opinions on how to proceed with that patch.
> > Indeed there's the simplest solution - use this implementation only in
> > my architecture of preference (read ARC) but IMHO this change may
> > benefit other architectures as well.
> 
> I tried something similar for MIPS a while ago after noticing a similar
> perf report.  Adapting Nico's ARM code gave some nice speedups, but only
> when I used MIPS assembly for the long multiplies.  Apparently gcc is
> still too stupid to do the sane thing.

Could you please elaborate a little bit on what was a problem with gcc
compared to hand-written asm?

The point is if preprocessor does proper constant propagation then compiler
will need to implement only calculations marked "run-time calculations".
And in its turn those are pretty straight-forward 32-bit + and *.

And at least on ARC I saw with that change perf no longer captures
__div64_32() during iperf and iperf results itself improved for about 10%.
So I'd say advantage is quite noticeable.

-Alexey

WARNING: multiple messages have this Message-ID (diff)
From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
To: "mans@mansr.com" <mans@mansr.com>
Cc: "shemminger@linux-foundation.org"
	<shemminger@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Vineet.Gupta1@synopsys.com" <Vineet.Gupta1@synopsys.com>,
	"linux-snps-arc@lists.infradead.org" 
	<linux-snps-arc@lists.infradead.org>,
	"rmk+kernel@arm.linux.org.uk" <rmk+kernel@arm.linux.org.uk>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"mingo@elte.hu" <mingo@elte.hu>, "nico@cam.org" <nico@cam.org>
Subject: Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
Date: Thu, 29 Oct 2015 13:05:44 +0000	[thread overview]
Message-ID: <1446123944.3203.8.camel@synopsys.com> (raw)
In-Reply-To: <yw1xr3kddegu.fsf@unicorn.mansr.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3398 bytes --]

Hi mans,

On Thu, 2015-10-29 at 12:52 +0000, Måns Rullgård wrote:
> Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
> 
> > Existing default implementation of __div64_32() for 32-bit arches unfolds
> > into huge routine with tons of arithmetics like +, -, * and all of them
> > in loops. That leads to obvious performance degradation if do_div() is
> > frequently used.
> > 
> > Good example is extensive TCP/IP traffic.
> > That's what I'm getting with perf out of iperf3:
> >  -------------->8--------------
> >     30.05%  iperf3   [kernel.kallsyms]        [k] copy_from_iter
> >     11.77%  iperf3   [kernel.kallsyms]        [k] __div64_32
> >      5.44%  iperf3   [kernel.kallsyms]        [k] memset
> >      5.32%  iperf3   [kernel.kallsyms]        [k] stmmac_xmit
> >      2.70%  iperf3   [kernel.kallsyms]        [k] skb_segment
> >      2.56%  iperf3   [kernel.kallsyms]        [k] tcp_ack
> >  -------------->8--------------
> > 
> > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
> > received from local_clock() to microseconds used in timestamp.
> > BTW conversion itself is as simple as "/=1000".
> > 
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> > 
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
> > 
> > And that change implements that.
> > 
> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> > 
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> > 
> > So I do want to hear opinions on how to proceed with that patch.
> > Indeed there's the simplest solution - use this implementation only in
> > my architecture of preference (read ARC) but IMHO this change may
> > benefit other architectures as well.
> 
> I tried something similar for MIPS a while ago after noticing a similar
> perf report.  Adapting Nico's ARM code gave some nice speedups, but only
> when I used MIPS assembly for the long multiplies.  Apparently gcc is
> still too stupid to do the sane thing.

Could you please elaborate a little bit on what was a problem with gcc
compared to hand-written asm?

The point is if preprocessor does proper constant propagation then compiler
will need to implement only calculations marked "run-time calculations".
And in its turn those are pretty straight-forward 32-bit + and *.

And at least on ARC I saw with that change perf no longer captures
__div64_32() during iperf and iperf results itself improved for about 10%.
So I'd say advantage is quite noticeable.

-Alexey
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

  reply	other threads:[~2015-10-29 13:05 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 22:47 ` Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-28 23:32   ` Nicolas Pitre
2015-10-29  7:34   ` Alexey Brodkin
2015-10-29  7:34     ` Alexey Brodkin
2015-10-30  1:26   ` Nicolas Pitre
2015-10-30  1:26     ` Nicolas Pitre
2015-10-30  5:41     ` Vineet Gupta
2015-10-30  5:41       ` Vineet Gupta
2015-10-30 12:41       ` Måns Rullgård
2015-10-30 12:41         ` Måns Rullgård
2015-10-30 12:40     ` Måns Rullgård
2015-10-30 12:40       ` Måns Rullgård
2015-10-30 15:17       ` Nicolas Pitre
2015-10-30 15:17         ` Nicolas Pitre
2015-10-30 15:54         ` Alexey Brodkin
2015-10-30 15:54           ` Alexey Brodkin
2015-10-30 16:55           ` Nicolas Pitre
2015-10-30 16:55             ` Nicolas Pitre
2015-10-30 17:45             ` Måns Rullgård
2015-10-30 17:45               ` Måns Rullgård
2015-11-04 23:46               ` Nicolas Pitre
2015-11-04 23:46                 ` Nicolas Pitre
2015-11-04 23:48                 ` Nicolas Pitre
2015-11-04 23:48                   ` Nicolas Pitre
2015-11-05  3:13                   ` Vineet Gupta
2015-11-05  3:13                     ` Vineet Gupta
2015-11-05  5:06                     ` Nicolas Pitre
2015-11-05  5:06                       ` Nicolas Pitre
2015-11-04 23:49                 ` Måns Rullgård
2015-11-04 23:49                   ` Måns Rullgård
2015-10-30 14:28     ` Alexey Brodkin
2015-10-30 14:28       ` Alexey Brodkin
2015-10-29  0:36 ` kbuild test robot
2015-10-29  0:36   ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 12:52   ` Måns Rullgård
2015-10-29 13:05   ` Alexey Brodkin [this message]
2015-10-29 13:05     ` Alexey Brodkin
2015-10-29 13:37     ` Måns Rullgård
2015-10-29 13:37       ` Måns Rullgård
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 13:31   ` Russell King - ARM Linux
2015-10-29 14:32   ` Alexey Brodkin
2015-10-29 14:32     ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap
2015-10-29 17:09   ` Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1446123944.3203.8.camel@synopsys.com \
    --to=alexey.brodkin@synopsys.com \
    --cc=linux-snps-arc@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.