* [PATCH] __div64_32: implement division by multiplication for 32-bit arches
@ 2015-10-28 22:47 Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
` (4 more replies)
0 siblings, 5 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-28 22:47 UTC (permalink / raw)
To: linux-kernel
Cc: Alexey Brodkin, linux-snps-arc, Vineet Gupta, Ingo Molnar,
Stephen Hemminger, David S. Miller, Nicolas Pitre, Russell King
Existing default implementation of __div64_32() for 32-bit arches unfolds
into huge routine with tons of arithmetics like +, -, * and all of them
in loops. That leads to obvious performance degradation if do_div() is
frequently used.
Good example is extensive TCP/IP traffic.
That's what I'm getting with perf out of iperf3:
-------------->8--------------
30.05% iperf3 [kernel.kallsyms] [k] copy_from_iter
11.77% iperf3 [kernel.kallsyms] [k] __div64_32
5.44% iperf3 [kernel.kallsyms] [k] memset
5.32% iperf3 [kernel.kallsyms] [k] stmmac_xmit
2.70% iperf3 [kernel.kallsyms] [k] skb_segment
2.56% iperf3 [kernel.kallsyms] [k] tcp_ack
-------------->8--------------
do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
received from local_clock() to microseconds used in timestamp.
BTW conversion itself is as simple as "/=1000".
Fortunately we already have much better __div64_32() for 32-bit ARM.
There in case of division by constant preprocessor calculates so-called
"magic number" which is later used in multiplications instead of divisions.
It's really nice and very optimal but obviously works only for ARM
because ARM assembly is involved.
Now why don't we extend the same approach to all other 32-bit arches
with multiplication part implemented in pure C. With good compiler
resulting assembly will be quite close to manually written assembly.
And that change implements that.
But there's at least 1 problem which I don't know how to solve.
Preprocessor magic only happens if __div64_32() is inlined (that's
obvious - preprocessor has to know if divider is constant or not).
But __div64_32() is already marked as weak function (which in its turn
is required to allow some architectures to provide its own optimal
implementations). I.e. addition of "inline" for __div64_32() is not an
option.
So I do want to hear opinions on how to proceed with that patch.
Indeed there's the simplest solution - use this implementation only in
my architecture of preference (read ARC) but IMHO this change may
benefit other architectures as well.
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Nicolas Pitre <nico@cam.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
---
lib/div64.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 128 insertions(+), 25 deletions(-)
diff --git a/lib/div64.c b/lib/div64.c
index 62a698a..3055328 100644
--- a/lib/div64.c
+++ b/lib/div64.c
@@ -23,37 +23,140 @@
/* Not needed on 64bit architectures */
#if BITS_PER_LONG == 32
+/* our own fls implementation to make sure constant propagation is fine */
+inline int __div64_fls(int bits)
+{
+ unsigned int __left = bits, __nr = 0;
+
+ if (__left & 0xffff0000)
+ __nr += 16, __left >>= 16;
+
+ if (__left & 0x0000ff00)
+ __nr += 8, __left >>= 8;
+
+ if (__left & 0x000000f0)
+ __nr += 4, __left >>= 4;
+
+ if (__left & 0x0000000c)
+ __nr += 2, __left >>= 2;
+
+ if (__left & 0x00000002)
+ __nr += 1;
+
+ return __nr;
+}
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications instead which is much faster.
+ */
uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
{
- uint64_t rem = *n;
- uint64_t b = base;
- uint64_t res, d = 1;
- uint32_t high = rem >> 32;
-
- /* Reduce the thing a bit first */
- res = 0;
- if (high >= base) {
- high /= base;
- res = (uint64_t) high << 32;
- rem -= (uint64_t) (high*base) << 32;
- }
+ unsigned int __r, __b = base;
- while ((int64_t)b > 0 && b < rem) {
- b = b+b;
- d = d+d;
- }
+ if (!__builtin_constant_p(__b) || __b == 0) {
+ /* non-constant divisor (or zero): slow path */
+ uint64_t rem = *n;
+ uint64_t b = base;
+ uint64_t res, d = 1;
+ uint32_t high = rem >> 32;
+
+ /* Reduce the thing a bit first */
+ res = 0;
+ if (high >= base) {
+ high /= base;
+ res = (uint64_t) high << 32;
+ rem -= (uint64_t) (high*base) << 32;
+ }
+
+ while ((int64_t)b > 0 && b < rem) {
+ b = b+b;
+ d = d+d;
+ }
+
+ do {
+ if (rem >= b) {
+ rem -= b;
+ res += d;
+ }
+ b >>= 1;
+ d >>= 1;
+ } while (d);
- do {
- if (rem >= b) {
- rem -= b;
- res += d;
+ *n = res;
+ __r = rem;
+ } else if ((__b & (__b - 1)) == 0) {
+ /*
+ * Trivial: __b is constant and a power of 2
+ * gcc does the right thing with this code.
+ * Even though code is the same as above but
+ * we make it visually as a separate path.
+ * Still only one of these branches will survive
+ * pre-processor stage, so let's leave it here.
+ */
+ __r = *n;
+ __r &= (__b - 1);
+ *n /= __b;
+ } else {
+ /* Start of preprocessor calculations */
+
+ /*
+ * Multiply by inverse of __b: *n/b = *n*(p/b)/p
+ * We rely on the fact that most of this code gets
+ * optimized away at compile time due to constant
+ * propagation and only a couple inline assembly
+ * instructions should remain. Better avoid any
+ * code construct that might prevent that.
+ */
+ unsigned long long __res, __x, __t, __m, __n = *n;
+ unsigned int __p;
+ /* preserve low part of *n for reminder computation */
+ __r = __n;
+ /* determine number of bits to represent __b */
+ __p = 1 << __div64_fls(__b);
+ /* compute __m = ((__p << 64) + __b - 1) / __b */
+ __m = (~0ULL / __b) * __p;
+ __m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;
+ /* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */
+ __x = ~0ULL / __b * __b - 1;
+ __res = (__m & 0xffffffff) * (__x & 0xffffffff);
+ __res >>= 32;
+ __res += (__m & 0xffffffff) * (__x >> 32);
+ __t = __res;
+ __res += (__x & 0xffffffff) * (__m >> 32);
+ __t = (__res < __t) ? (1ULL << 32) : 0;
+ __res = (__res >> 32) + __t;
+ __res += (__m >> 32) * (__x >> 32);
+ __res /= __p;
+ /* End of preprocessor calculations */
+
+ /* Start of run-time calculations */
+ __res = (unsigned int)__m * (unsigned int)__n;
+ __res >>= 32;
+ __res += (unsigned int)__m * (__n >> 32);
+ __t = __res;
+ __res += (unsigned int)__n * (__m >> 32);
+ __t = (__res < __t) ? (1ULL << 32) : 0;
+ __res = (__res >> 32) + __t;
+ __res += (__m >> 32) * (__n >> 32);
+ __res /= __p;
+
+ /*
+ * The reminder can be computed with 32-bit regs
+ * only, and gcc is good at that.
+ */
+ {
+ unsigned int __res0 = __res;
+ unsigned int __b0 = __b;
+
+ __r -= __res0 * __b0;
}
- b >>= 1;
- d >>= 1;
- } while (d);
+ /* End of run-time calculations */
- *n = res;
- return rem;
+ *n = __res;
+ }
+ return __r;
}
EXPORT_SYMBOL(__div64_32);
--
2.4.3
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
@ 2015-10-28 23:32 ` Nicolas Pitre
2015-10-29 7:34 ` Alexey Brodkin
2015-10-30 1:26 ` Nicolas Pitre
2015-10-29 0:36 ` kbuild test robot
` (3 subsequent siblings)
4 siblings, 2 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-28 23:32 UTC (permalink / raw)
To: Alexey Brodkin
Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
Stephen Hemminger, David S. Miller, Russell King
On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> Fortunately we already have much better __div64_32() for 32-bit ARM.
> There in case of division by constant preprocessor calculates so-called
> "magic number" which is later used in multiplications instead of divisions.
It's not magic, it is science. :-)
> It's really nice and very optimal but obviously works only for ARM
> because ARM assembly is involved.
>
> Now why don't we extend the same approach to all other 32-bit arches
> with multiplication part implemented in pure C. With good compiler
> resulting assembly will be quite close to manually written assembly.
You appear to have left out optimizations where there is no overflow to
carry. That, too, can be determined at compile time.
> But there's at least 1 problem which I don't know how to solve.
> Preprocessor magic only happens if __div64_32() is inlined (that's
> obvious - preprocessor has to know if divider is constant or not).
>
> But __div64_32() is already marked as weak function (which in its turn
> is required to allow some architectures to provide its own optimal
> implementations). I.e. addition of "inline" for __div64_32() is not an
> option.
You can't inline __div64_32(). It should remain as is and used only for
the slow path.
For the constant based optimization to work, you need to modify do_div()
in include/asm-generic/div64.h directly.
> So I do want to hear opinions on how to proceed with that patch.
> Indeed there's the simplest solution - use this implementation only in
> my architecture of preference (read ARC) but IMHO this change may
> benefit other architectures as well.
>
> Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
> Cc: linux-snps-arc@lists.infradead.org
> Cc: Vineet Gupta <vgupta@synopsys.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Nicolas Pitre <nico@cam.org>
This email address has been unused for the last 7 years. Please update
your reference.
> Cc: Russell King <rmk+kernel@arm.linux.org.uk>
> ---
> lib/div64.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 128 insertions(+), 25 deletions(-)
>
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> @@ -23,37 +23,140 @@
> /* Not needed on 64bit architectures */
> #if BITS_PER_LONG == 32
>
> +/* our own fls implementation to make sure constant propagation is fine */
> +inline int __div64_fls(int bits)
> +{
> + unsigned int __left = bits, __nr = 0;
> +
> + if (__left & 0xffff0000)
> + __nr += 16, __left >>= 16;
> +
> + if (__left & 0x0000ff00)
> + __nr += 8, __left >>= 8;
> +
> + if (__left & 0x000000f0)
> + __nr += 4, __left >>= 4;
> +
> + if (__left & 0x0000000c)
> + __nr += 2, __left >>= 2;
> +
> + if (__left & 0x00000002)
> + __nr += 1;
> +
> + return __nr;
> +}
The regular fls implementation should already give you a constant result
if provided with a constant input. To be sure you could use:
__p = 1 << __fls(__b);
BUILD_BUG_ON(!__builtin_constant_p(__p));
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
> uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
> {
> - uint64_t rem = *n;
> - uint64_t b = base;
> - uint64_t res, d = 1;
> - uint32_t high = rem >> 32;
> -
> - /* Reduce the thing a bit first */
> - res = 0;
> - if (high >= base) {
> - high /= base;
> - res = (uint64_t) high << 32;
> - rem -= (uint64_t) (high*base) << 32;
> - }
> + unsigned int __r, __b = base;
>
> - while ((int64_t)b > 0 && b < rem) {
> - b = b+b;
> - d = d+d;
> - }
> + if (!__builtin_constant_p(__b) || __b == 0) {
> + /* non-constant divisor (or zero): slow path */
> + uint64_t rem = *n;
> + uint64_t b = base;
> + uint64_t res, d = 1;
> + uint32_t high = rem >> 32;
> +
> + /* Reduce the thing a bit first */
> + res = 0;
> + if (high >= base) {
> + high /= base;
> + res = (uint64_t) high << 32;
> + rem -= (uint64_t) (high*base) << 32;
> + }
> +
> + while ((int64_t)b > 0 && b < rem) {
> + b = b+b;
> + d = d+d;
> + }
> +
> + do {
> + if (rem >= b) {
> + rem -= b;
> + res += d;
> + }
> + b >>= 1;
> + d >>= 1;
> + } while (d);
>
> - do {
> - if (rem >= b) {
> - rem -= b;
> - res += d;
> + *n = res;
> + __r = rem;
> + } else if ((__b & (__b - 1)) == 0) {
> + /*
> + * Trivial: __b is constant and a power of 2
> + * gcc does the right thing with this code.
> + * Even though code is the same as above but
> + * we make it visually as a separate path.
> + * Still only one of these branches will survive
> + * pre-processor stage, so let's leave it here.
> + */
> + __r = *n;
> + __r &= (__b - 1);
> + *n /= __b;
> + } else {
> + /* Start of preprocessor calculations */
> +
> + /*
> + * Multiply by inverse of __b: *n/b = *n*(p/b)/p
> + * We rely on the fact that most of this code gets
> + * optimized away at compile time due to constant
> + * propagation and only a couple inline assembly
> + * instructions should remain. Better avoid any
> + * code construct that might prevent that.
> + */
> + unsigned long long __res, __x, __t, __m, __n = *n;
> + unsigned int __p;
> + /* preserve low part of *n for reminder computation */
> + __r = __n;
> + /* determine number of bits to represent __b */
> + __p = 1 << __div64_fls(__b);
> + /* compute __m = ((__p << 64) + __b - 1) / __b */
> + __m = (~0ULL / __b) * __p;
> + __m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;
> + /* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */
> + __x = ~0ULL / __b * __b - 1;
> + __res = (__m & 0xffffffff) * (__x & 0xffffffff);
> + __res >>= 32;
> + __res += (__m & 0xffffffff) * (__x >> 32);
> + __t = __res;
> + __res += (__x & 0xffffffff) * (__m >> 32);
> + __t = (__res < __t) ? (1ULL << 32) : 0;
> + __res = (__res >> 32) + __t;
> + __res += (__m >> 32) * (__x >> 32);
> + __res /= __p;
> + /* End of preprocessor calculations */
> +
> + /* Start of run-time calculations */
> + __res = (unsigned int)__m * (unsigned int)__n;
> + __res >>= 32;
> + __res += (unsigned int)__m * (__n >> 32);
> + __t = __res;
> + __res += (unsigned int)__n * (__m >> 32);
> + __t = (__res < __t) ? (1ULL << 32) : 0;
> + __res = (__res >> 32) + __t;
> + __res += (__m >> 32) * (__n >> 32);
> + __res /= __p;
> +
> + /*
> + * The reminder can be computed with 32-bit regs
> + * only, and gcc is good at that.
> + */
> + {
> + unsigned int __res0 = __res;
> + unsigned int __b0 = __b;
> +
> + __r -= __res0 * __b0;
> }
> - b >>= 1;
> - d >>= 1;
> - } while (d);
> + /* End of run-time calculations */
>
> - *n = res;
> - return rem;
> + *n = __res;
> + }
> + return __r;
> }
>
> EXPORT_SYMBOL(__div64_32);
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
@ 2015-10-29 0:36 ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
` (2 subsequent siblings)
4 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2015-10-29 0:36 UTC (permalink / raw)
To: Alexey Brodkin
Cc: kbuild-all, linux-kernel, Alexey Brodkin, linux-snps-arc,
Vineet Gupta, Ingo Molnar, Stephen Hemminger, David S. Miller,
Nicolas Pitre, Russell King
[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]
Hi Alexey,
[auto build test ERROR on v4.3-rc7 -- if it's inappropriate base, please suggest rules for selecting the more suitable base]
url: https://github.com/0day-ci/linux/commits/Alexey-Brodkin/__div64_32-implement-division-by-multiplication-for-32-bit-arches/20151029-065010
config: arm-mini2440_defconfig (attached as .config)
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm
All error/warnings (new ones prefixed by >>):
In file included from include/linux/kernel.h:136:0,
from lib/div64.c:20:
>> arch/arm/include/asm/div64.h:215:2: error: expected identifier or '(' before '{' token
({ \
^
>> lib/div64.c:27:12: note: in expansion of macro '__div64_fls'
inline int __div64_fls(int bits)
^
vim +/__div64_fls +27 lib/div64.c
14 * Code generated for this function might be very inefficient
15 * for some CPUs. __div64_32() can be overridden by linking arch-specific
16 * assembly versions such as arch/ppc/lib/div64.S and arch/sh/lib/div64.S.
17 */
18
19 #include <linux/export.h>
> 20 #include <linux/kernel.h>
21 #include <linux/math64.h>
22
23 /* Not needed on 64bit architectures */
24 #if BITS_PER_LONG == 32
25
26 /* our own fls implementation to make sure constant propagation is fine */
> 27 inline int __div64_fls(int bits)
28 {
29 unsigned int __left = bits, __nr = 0;
30
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 19049 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 23:32 ` Nicolas Pitre
@ 2015-10-29 7:34 ` Alexey Brodkin
2015-10-30 1:26 ` Nicolas Pitre
1 sibling, 0 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-29 7:34 UTC (permalink / raw)
To: nicolas.pitre@linaro.org
Cc: rmk+kernel@arm.linux.org.uk, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, shemminger@linux-foundation.org,
mingo@elte.hu, linux-snps-arc@lists.infradead.org,
davem@davemloft.net
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3083 bytes --]
Hi Nicolas,
On Wed, 2015-10-28 at 19:32 -0400, Nicolas Pitre wrote:
> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
>
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
>
> It's not magic, it is science. :-)
Indeed, but I was under impression that's how people call that value in that
particular case. So for me it looks appropriate here.
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> >
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
>
> You appear to have left out optimizations where there is no overflow to
> carry. That, too, can be determined at compile time.
That might be the case - let me look at that a bit more.
But that was not the biggest problem. I actually wanted to send it
as RFC but due to last minute change I made "git pathc-format -1" and
forgot to change topic from PATCH to RFC.
> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> >
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
>
> You can't inline __div64_32(). It should remain as is and used only for
> the slow path.
>
> For the constant based optimization to work, you need to modify do_div()
> in include/asm-generic/div64.h directly.
I thought about that but if I replace existing implementation of do_div()
with proposed here some arches like SH and MIPS won't be able to use their
own __div64_32() in do_div() any longer.
So how to deal with that then?
> > So I do want to hear opinions on how to proceed with that patch.
> > Indeed there's the simplest solution - use this implementation only in
> > my architecture of preference (read ARC) but IMHO this change may
> > benefit other architectures as well.
> >
> > Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
> > Cc: linux-snps-arc@lists.infradead.org
> > Cc: Vineet Gupta <vgupta@synopsys.com>
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> > Cc: David S. Miller <davem@davemloft.net>
> > Cc: Nicolas Pitre <nico@cam.org>
>
> This email address has been unused for the last 7 years. Please update
> your reference.
My bad - I blindly took that email from your prehistoric patch
"[ARM] 3611/4: optimize do_div() when divisor is constant".
-Alexeyÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-29 0:36 ` kbuild test robot
@ 2015-10-29 12:52 ` Måns Rullgård
2015-10-29 13:05 ` Alexey Brodkin
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 17:09 ` Randy Dunlap
4 siblings, 1 reply; 24+ messages in thread
From: Måns Rullgård @ 2015-10-29 12:52 UTC (permalink / raw)
To: Alexey Brodkin
Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
Stephen Hemminger, David S. Miller, Nicolas Pitre, Russell King
Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
> Existing default implementation of __div64_32() for 32-bit arches unfolds
> into huge routine with tons of arithmetics like +, -, * and all of them
> in loops. That leads to obvious performance degradation if do_div() is
> frequently used.
>
> Good example is extensive TCP/IP traffic.
> That's what I'm getting with perf out of iperf3:
> -------------->8--------------
> 30.05% iperf3 [kernel.kallsyms] [k] copy_from_iter
> 11.77% iperf3 [kernel.kallsyms] [k] __div64_32
> 5.44% iperf3 [kernel.kallsyms] [k] memset
> 5.32% iperf3 [kernel.kallsyms] [k] stmmac_xmit
> 2.70% iperf3 [kernel.kallsyms] [k] skb_segment
> 2.56% iperf3 [kernel.kallsyms] [k] tcp_ack
> -------------->8--------------
>
> do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
> received from local_clock() to microseconds used in timestamp.
> BTW conversion itself is as simple as "/=1000".
>
> Fortunately we already have much better __div64_32() for 32-bit ARM.
> There in case of division by constant preprocessor calculates so-called
> "magic number" which is later used in multiplications instead of divisions.
> It's really nice and very optimal but obviously works only for ARM
> because ARM assembly is involved.
>
> Now why don't we extend the same approach to all other 32-bit arches
> with multiplication part implemented in pure C. With good compiler
> resulting assembly will be quite close to manually written assembly.
>
> And that change implements that.
>
> But there's at least 1 problem which I don't know how to solve.
> Preprocessor magic only happens if __div64_32() is inlined (that's
> obvious - preprocessor has to know if divider is constant or not).
>
> But __div64_32() is already marked as weak function (which in its turn
> is required to allow some architectures to provide its own optimal
> implementations). I.e. addition of "inline" for __div64_32() is not an
> option.
>
> So I do want to hear opinions on how to proceed with that patch.
> Indeed there's the simplest solution - use this implementation only in
> my architecture of preference (read ARC) but IMHO this change may
> benefit other architectures as well.
I tried something similar for MIPS a while ago after noticing a similar
perf report. Adapting Nico's ARM code gave some nice speedups, but only
when I used MIPS assembly for the long multiplies. Apparently gcc is
still too stupid to do the sane thing.
--
Måns Rullgård
mans@mansr.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-29 12:52 ` Måns Rullgård
@ 2015-10-29 13:05 ` Alexey Brodkin
2015-10-29 13:37 ` Måns Rullgård
0 siblings, 1 reply; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-29 13:05 UTC (permalink / raw)
To: mans@mansr.com
Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
rmk+kernel@arm.linux.org.uk, davem@davemloft.net, mingo@elte.hu,
nico@cam.org
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3398 bytes --]
Hi mans,
On Thu, 2015-10-29 at 12:52 +0000, Måns Rullgård wrote:
> Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
>
> > Existing default implementation of __div64_32() for 32-bit arches unfolds
> > into huge routine with tons of arithmetics like +, -, * and all of them
> > in loops. That leads to obvious performance degradation if do_div() is
> > frequently used.
> >
> > Good example is extensive TCP/IP traffic.
> > That's what I'm getting with perf out of iperf3:
> > -------------->8--------------
> > 30.05% iperf3 [kernel.kallsyms] [k] copy_from_iter
> > 11.77% iperf3 [kernel.kallsyms] [k] __div64_32
> > 5.44% iperf3 [kernel.kallsyms] [k] memset
> > 5.32% iperf3 [kernel.kallsyms] [k] stmmac_xmit
> > 2.70% iperf3 [kernel.kallsyms] [k] skb_segment
> > 2.56% iperf3 [kernel.kallsyms] [k] tcp_ack
> > -------------->8--------------
> >
> > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
> > received from local_clock() to microseconds used in timestamp.
> > BTW conversion itself is as simple as "/=1000".
> >
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> >
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
> >
> > And that change implements that.
> >
> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> >
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> >
> > So I do want to hear opinions on how to proceed with that patch.
> > Indeed there's the simplest solution - use this implementation only in
> > my architecture of preference (read ARC) but IMHO this change may
> > benefit other architectures as well.
>
> I tried something similar for MIPS a while ago after noticing a similar
> perf report. Adapting Nico's ARM code gave some nice speedups, but only
> when I used MIPS assembly for the long multiplies. Apparently gcc is
> still too stupid to do the sane thing.
Could you please elaborate a little bit on what was a problem with gcc
compared to hand-written asm?
The point is if preprocessor does proper constant propagation then compiler
will need to implement only calculations marked "run-time calculations".
And in its turn those are pretty straight-forward 32-bit + and *.
And at least on ARC I saw with that change perf no longer captures
__div64_32() during iperf and iperf results itself improved for about 10%.
So I'd say advantage is quite noticeable.
-Alexey
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
` (2 preceding siblings ...)
2015-10-29 12:52 ` Måns Rullgård
@ 2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 14:32 ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap
4 siblings, 1 reply; 24+ messages in thread
From: Russell King - ARM Linux @ 2015-10-29 13:31 UTC (permalink / raw)
To: Alexey Brodkin
Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
Stephen Hemminger, David S. Miller, Nicolas Pitre
On Thu, Oct 29, 2015 at 01:47:35AM +0300, Alexey Brodkin wrote:
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
> uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
> {
> - uint64_t rem = *n;
> - uint64_t b = base;
> - uint64_t res, d = 1;
> - uint32_t high = rem >> 32;
> -
> - /* Reduce the thing a bit first */
> - res = 0;
> - if (high >= base) {
> - high /= base;
> - res = (uint64_t) high << 32;
> - rem -= (uint64_t) (high*base) << 32;
> - }
> + unsigned int __r, __b = base;
>
> - while ((int64_t)b > 0 && b < rem) {
> - b = b+b;
> - d = d+d;
> - }
> + if (!__builtin_constant_p(__b) || __b == 0) {
Can you explain who __builtin_constant_p(__b) can be anything but false
here? I can't see that this will ever be true.
This is a function in its own .c file - the compiler will have no
knowledge about the callers of this function scattered throughout the
kernel, and it has to assume that the 'base' argument to this function
is variable. So, __builtin_constant_p(__b) will always be false, which
means this if () statement will always be true and the else clause will
never be used.
--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-29 13:05 ` Alexey Brodkin
@ 2015-10-29 13:37 ` Måns Rullgård
0 siblings, 0 replies; 24+ messages in thread
From: Måns Rullgård @ 2015-10-29 13:37 UTC (permalink / raw)
To: Alexey Brodkin
Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
rmk+kernel@arm.linux.org.uk, davem@davemloft.net, mingo@elte.hu,
Nicolas Pitre
Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
> Hi mans,
>
> On Thu, 2015-10-29 at 12:52 +0000, Måns Rullgård wrote:
>> Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
>>
>> > Existing default implementation of __div64_32() for 32-bit arches unfolds
>> > into huge routine with tons of arithmetics like +, -, * and all of them
>> > in loops. That leads to obvious performance degradation if do_div() is
>> > frequently used.
>> >
>> > Good example is extensive TCP/IP traffic.
>> > That's what I'm getting with perf out of iperf3:
>> > -------------->8--------------
>> > 30.05% iperf3 [kernel.kallsyms] [k] copy_from_iter
>> > 11.77% iperf3 [kernel.kallsyms] [k] __div64_32
>> > 5.44% iperf3 [kernel.kallsyms] [k] memset
>> > 5.32% iperf3 [kernel.kallsyms] [k] stmmac_xmit
>> > 2.70% iperf3 [kernel.kallsyms] [k] skb_segment
>> > 2.56% iperf3 [kernel.kallsyms] [k] tcp_ack
>> > -------------->8--------------
>> >
>> > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
>> > received from local_clock() to microseconds used in timestamp.
>> > BTW conversion itself is as simple as "/=1000".
>> >
>> > Fortunately we already have much better __div64_32() for 32-bit ARM.
>> > There in case of division by constant preprocessor calculates so-called
>> > "magic number" which is later used in multiplications instead of divisions.
>> > It's really nice and very optimal but obviously works only for ARM
>> > because ARM assembly is involved.
>> >
>> > Now why don't we extend the same approach to all other 32-bit arches
>> > with multiplication part implemented in pure C. With good compiler
>> > resulting assembly will be quite close to manually written assembly.
>> >
>> > And that change implements that.
>> >
>> > But there's at least 1 problem which I don't know how to solve.
>> > Preprocessor magic only happens if __div64_32() is inlined (that's
>> > obvious - preprocessor has to know if divider is constant or not).
>> >
>> > But __div64_32() is already marked as weak function (which in its turn
>> > is required to allow some architectures to provide its own optimal
>> > implementations). I.e. addition of "inline" for __div64_32() is not an
>> > option.
>> >
>> > So I do want to hear opinions on how to proceed with that patch.
>> > Indeed there's the simplest solution - use this implementation only in
>> > my architecture of preference (read ARC) but IMHO this change may
>> > benefit other architectures as well.
>>
>> I tried something similar for MIPS a while ago after noticing a similar
>> perf report. Adapting Nico's ARM code gave some nice speedups, but only
>> when I used MIPS assembly for the long multiplies. Apparently gcc is
>> still too stupid to do the sane thing.
>
> Could you please elaborate a little bit on what was a problem with gcc
> compared to hand-written asm?
In the final multiplications (the ones using ARM assembly), gcc has a
tendency to multiply things by zero and add the (zero) result to
something. This generally happens when multiplying a 64-bit value by a
32-bit one. The 32-bit value is simply converted to 64-bit by the usual
promotion rules, and gcc forgets that the upper half is know to be zero.
> The point is if preprocessor does proper constant propagation then compiler
> will need to implement only calculations marked "run-time calculations".
> And in its turn those are pretty straight-forward 32-bit + and *.
The constant calculation is fine. It's the final multiplication that's
the problem.
> And at least on ARC I saw with that change perf no longer captures
> __div64_32() during iperf and iperf results itself improved for about 10%.
> So I'd say advantage is quite noticeable.
There was an improvement without assembly as well, but with the MIPS
equivalent of the ARM assembly, it got much better.
--
Måns Rullgård
mans@mansr.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-29 13:31 ` Russell King - ARM Linux
@ 2015-10-29 14:32 ` Alexey Brodkin
0 siblings, 0 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-29 14:32 UTC (permalink / raw)
To: linux@arm.linux.org.uk
Cc: nicolas.pitre@linaro.org, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, shemminger@linux-foundation.org,
mingo@elte.hu, linux-snps-arc@lists.infradead.org,
davem@davemloft.net
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2179 bytes --]
Hi Russel,
On Thu, 2015-10-29 at 13:31 +0000, Russell King - ARM Linux wrote:
> On Thu, Oct 29, 2015 at 01:47:35AM +0300, Alexey Brodkin wrote:
> > diff --git a/lib/div64.c b/lib/div64.c
> > index 62a698a..3055328 100644
> > --- a/lib/div64.c
> > +++ b/lib/div64.c
> > +/*
> > + * If the divisor happens to be constant, we determine the appropriate
> > + * inverse at compile time to turn the division into a few inline
> > + * multiplications instead which is much faster.
> > + */
> > uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
> > {
> > - uint64_t rem = *n;
> > - uint64_t b = base;
> > - uint64_t res, d = 1;
> > - uint32_t high = rem >> 32;
> > -
> > - /* Reduce the thing a bit first */
> > - res = 0;
> > - if (high >= base) {
> > - high /= base;
> > - res = (uint64_t) high << 32;
> > - rem -= (uint64_t) (high*base) << 32;
> > - }
> > + unsigned int __r, __b = base;
> >
> > - while ((int64_t)b > 0 && b < rem) {
> > - b = b+b;
> > - d = d+d;
> > - }
> > + if (!__builtin_constant_p(__b) || __b == 0) {
>
> Can you explain who __builtin_constant_p(__b) can be anything but false
> here? I can't see that this will ever be true.
>
> This is a function in its own .c file - the compiler will have no
> knowledge about the callers of this function scattered throughout the
> kernel, and it has to assume that the 'base' argument to this function
> is variable. So, __builtin_constant_p(__b) will always be false, which
> means this if () statement will always be true and the else clause will
> never be used.
Essentially constant propagation will only happen if __div64_32() is inlined.
For that we need to add "inline" specifier to __div64_32(), but that in its
turn will prevent use of arch-specific more optimal __div64_32() implementation.
And that was my main question how to implement this properly: have better generic
do_div() or __div64_32() as its heavy lifting part and still keep an ability for
some architectures to use their own implementations.
-Alexey
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
` (3 preceding siblings ...)
2015-10-29 13:31 ` Russell King - ARM Linux
@ 2015-10-29 17:09 ` Randy Dunlap
4 siblings, 0 replies; 24+ messages in thread
From: Randy Dunlap @ 2015-10-29 17:09 UTC (permalink / raw)
To: Alexey Brodkin, linux-kernel
Cc: linux-snps-arc, Vineet Gupta, Ingo Molnar, Stephen Hemminger,
David S. Miller, Nicolas Pitre, Russell King
typos (spellos):
On 10/28/15 15:47, Alexey Brodkin wrote:
> ---
> lib/div64.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 128 insertions(+), 25 deletions(-)
>
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> @@ -23,37 +23,140 @@
> /* Not needed on 64bit architectures */
> #if BITS_PER_LONG == 32
>
> +
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
> uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
> {
> - uint64_t rem = *n;
> - uint64_t b = base;
> - uint64_t res, d = 1;
> - uint32_t high = rem >> 32;
> -
> - /* Reduce the thing a bit first */
> - res = 0;
> - if (high >= base) {
> - high /= base;
> - res = (uint64_t) high << 32;
> - rem -= (uint64_t) (high*base) << 32;
> - }
> + unsigned int __r, __b = base;
>
> - while ((int64_t)b > 0 && b < rem) {
> - b = b+b;
> - d = d+d;
> - }
> + if (!__builtin_constant_p(__b) || __b == 0) {
> + /* non-constant divisor (or zero): slow path */
> + uint64_t rem = *n;
> + uint64_t b = base;
> + uint64_t res, d = 1;
> + uint32_t high = rem >> 32;
> +
> + /* Reduce the thing a bit first */
> + res = 0;
> + if (high >= base) {
> + high /= base;
> + res = (uint64_t) high << 32;
> + rem -= (uint64_t) (high*base) << 32;
> + }
> +
> + while ((int64_t)b > 0 && b < rem) {
> + b = b+b;
> + d = d+d;
> + }
> +
> + do {
> + if (rem >= b) {
> + rem -= b;
> + res += d;
> + }
> + b >>= 1;
> + d >>= 1;
> + } while (d);
>
> - do {
> - if (rem >= b) {
> - rem -= b;
> - res += d;
> + *n = res;
> + __r = rem;
> + } else if ((__b & (__b - 1)) == 0) {
> + /*
> + * Trivial: __b is constant and a power of 2
> + * gcc does the right thing with this code.
> + * Even though code is the same as above but
> + * we make it visually as a separate path.
> + * Still only one of these branches will survive
> + * pre-processor stage, so let's leave it here.
> + */
> + __r = *n;
> + __r &= (__b - 1);
> + *n /= __b;
> + } else {
> + /* Start of preprocessor calculations */
> +
> + /*
> + * Multiply by inverse of __b: *n/b = *n*(p/b)/p
> + * We rely on the fact that most of this code gets
> + * optimized away at compile time due to constant
> + * propagation and only a couple inline assembly
> + * instructions should remain. Better avoid any
> + * code construct that might prevent that.
> + */
> + unsigned long long __res, __x, __t, __m, __n = *n;
> + unsigned int __p;
> + /* preserve low part of *n for reminder computation */
remainder
> + __r = __n;
> + /* determine number of bits to represent __b */
> + __p = 1 << __div64_fls(__b);
> + /* compute __m = ((__p << 64) + __b - 1) / __b */
> + __m = (~0ULL / __b) * __p;
> + __m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;
> + /* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */
> + __x = ~0ULL / __b * __b - 1;
> + __res = (__m & 0xffffffff) * (__x & 0xffffffff);
> + __res >>= 32;
> + __res += (__m & 0xffffffff) * (__x >> 32);
> + __t = __res;
> + __res += (__x & 0xffffffff) * (__m >> 32);
> + __t = (__res < __t) ? (1ULL << 32) : 0;
> + __res = (__res >> 32) + __t;
> + __res += (__m >> 32) * (__x >> 32);
> + __res /= __p;
> + /* End of preprocessor calculations */
> +
> + /* Start of run-time calculations */
> + __res = (unsigned int)__m * (unsigned int)__n;
> + __res >>= 32;
> + __res += (unsigned int)__m * (__n >> 32);
> + __t = __res;
> + __res += (unsigned int)__n * (__m >> 32);
> + __t = (__res < __t) ? (1ULL << 32) : 0;
> + __res = (__res >> 32) + __t;
> + __res += (__m >> 32) * (__n >> 32);
> + __res /= __p;
> +
> + /*
> + * The reminder can be computed with 32-bit regs
remainder
> + * only, and gcc is good at that.
> + */
> + {
> + unsigned int __res0 = __res;
> + unsigned int __b0 = __b;
> +
> + __r -= __res0 * __b0;
> }
> - b >>= 1;
> - d >>= 1;
> - } while (d);
> + /* End of run-time calculations */
>
> - *n = res;
> - return rem;
> + *n = __res;
> + }
> + return __r;
> }
>
> EXPORT_SYMBOL(__div64_32);
--
~Randy
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-28 23:32 ` Nicolas Pitre
2015-10-29 7:34 ` Alexey Brodkin
@ 2015-10-30 1:26 ` Nicolas Pitre
2015-10-30 5:41 ` Vineet Gupta
` (2 more replies)
1 sibling, 3 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-30 1:26 UTC (permalink / raw)
To: Alexey Brodkin
Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
Stephen Hemminger, David S. Miller, Russell King
On Wed, 28 Oct 2015, Nicolas Pitre wrote:
> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
>
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
>
> It's not magic, it is science. :-)
>
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> >
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
Well... not as close at least on ARM. Maybe 2x to 3x more costly than
the one with assembly. Still better than 100x or so without this
optimization.
> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> >
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
>
> You can't inline __div64_32(). It should remain as is and used only for
> the slow path.
>
> For the constant based optimization to work, you need to modify do_div()
> in include/asm-generic/div64.h directly.
OK... I was intrigued, so I adapted my ARM code to the generic case,
including the overflow avoidance optimizations. Please have look and
tell me how this works for you.
If this patch is accepted upstream, then it could be possible to
abstract only the actual multiplication part with some architecture
specific assembly.
diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 8f4e319334..43c3b21dca 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -32,6 +32,149 @@
#elif BITS_PER_LONG == 32
+/* macroized fls implementation to ensure proper constant propagation */
+#define __div64_fls(bits) \
+({ \
+ unsigned int __left = (bits), __nr = 0; \
+ if (__left & 0xffff0000) __nr += 16, __left >>= 16; \
+ if (__left & 0x0000ff00) __nr += 8, __left >>= 8; \
+ if (__left & 0x000000f0) __nr += 4, __left >>= 4; \
+ if (__left & 0x0000000c) __nr += 2, __left >>= 2; \
+ if (__left & 0x00000002) __nr += 1; \
+ __nr; \
+})
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications which ought to be much faster. And yet only if compiling
+ * with a sufficiently recent gcc version to perform proper 64-bit constant
+ * propagation.
+ *
+ * (It is unfortunate that gcc doesn't perform all this internally.)
+ */
+#define __div64_const32(n, ___b) \
+({ \
+ /* \
+ * Multiplication by reciprocal of b: n / b = n * (p / b) / p \
+ * \
+ * We rely on the fact that most of this code gets optimized \
+ * away at compile time due to constant propagation and only \
+ * a few multiplication instructions should remain. \
+ * Hence this monstrous macro (static inline doesn't always \
+ * do the trick for some reason). \
+ */ \
+ uint64_t ___res, ___x, ___t, ___m, ___n = (n); \
+ uint32_t ___c, ___p, ___m_lo, ___m_hi, ___n_lo, ___n_hi; \
+ \
+ /* determine number of bits to represent b */ \
+ ___p = 1 << __div64_fls(___b); \
+ \
+ /* compute m = ((p << 64) + b - 1) / b */ \
+ ___m = (~0ULL / ___b) * ___p; \
+ ___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b; \
+ \
+ /* dividend that produces one less than the highest result */ \
+ ___x = ~0ULL / ___b * ___b - 1; \
+ \
+ /* test our m with res = m * x / (p << 64) */ \
+ ___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32; \
+ ___t = ___res += (___m & 0xffffffff) * (___x >> 32); \
+ ___res += (___x & 0xffffffff) * (___m >> 32); \
+ ___t = (___res < ___t) ? (1ULL << 32) : 0; \
+ ___res = (___res >> 32) + ___t; \
+ ___res += (___m >> 32) * (___x >> 32); \
+ ___res /= ___p; \
+ \
+ /* Now sanitize and optimize what we've got. */ \
+ if (~0ULL % (___b / (___b & -___b)) == 0) { \
+ /* those cases can be simplified with: */ \
+ ___n /= (___b & -___b); \
+ ___m = ~0ULL / (___b / (___b & -___b)); \
+ ___p = 1; \
+ ___c = 1; \
+ } else if (___res != ___x / ___b) { \
+ /* \
+ * We can't get away without a correction to compensate \
+ * for bit truncation errors. To avoid it we'd need an \
+ * additional bit to represent m which would overflow \
+ * our 64-bit variable. \
+ * \
+ * Instead we do m = p / b and n / b = (n * m + m) / p. \
+ */ \
+ ___c = 1; \
+ /* Compute m = (p << 64) / b */ \
+ ___m = (~0ULL / ___b) * ___p; \
+ ___m += ((~0ULL % ___b + 1) * ___p) / ___b; \
+ } else { \
+ /* \
+ * Reduce m / p, and try to clear bit 31 of m when \
+ * possible, otherwise that'll need extra overflow \
+ * handling later. \
+ */ \
+ unsigned int ___bits = -(___m & -___m); \
+ ___bits |= ___m >> 32; \
+ ___bits = (~___bits) << 1; \
+ /* \
+ * If ___bits == 0 then setting bit 31 is unavoidable. \
+ * Simply apply the maximum possible reduction in that \
+ * case. Otherwise the MSB of ___bits indicates the \
+ * best reduction we should apply. \
+ */ \
+ if (!___bits) { \
+ ___p /= (___m & -___m); \
+ ___m /= (___m & -___m); \
+ } else { \
+ ___p >>= __div64_fls(___bits); \
+ ___m >>= __div64_fls(___bits); \
+ } \
+ /* No correction needed. */ \
+ ___c = 0; \
+ } \
+ \
+ /* \
+ * Now we have a combination of 2 conditions: \
+ * \
+ * 1) whether or not we need a correction (___c), and \
+ * \
+ * 2) whether or not there might be an overflow in the cross \
+ * product determined by (___m & ((1 << 63) | (1 << 31))). \
+ * \
+ * Select the best way to do the m * n / (p << 64) operation. \
+ * From here on there will be actual runtime code generated. \
+ */ \
+ \
+ ___m_lo = ___m; \
+ ___m_hi = ___m >> 32; \
+ ___n_lo = ___n; \
+ ___n_hi = ___n >> 32; \
+ \
+ if (!___c) { \
+ ___res = ((uint64_t)___m_lo * ___n_lo) >> 32; \
+ } else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) { \
+ ___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32; \
+ } else { \
+ ___res = ___m + (uint64_t)___m_lo * ___n_lo; \
+ ___t = (___res < ___m) ? (1ULL << 32) : 0; \
+ ___res = (___res >> 32) + ___t; \
+ } \
+ \
+ if (!(___m & ((1ULL << 63) | (1ULL << 31)))) { \
+ ___res += (uint64_t)___m_lo * ___n_hi; \
+ ___res += (uint64_t)___m_hi * ___n_lo; \
+ ___res >>= 32; \
+ } else { \
+ ___t = ___res += (uint64_t)___m_lo * ___n_hi; \
+ ___res += (uint64_t)___m_hi * ___n_lo; \
+ ___t = (___res < ___t) ? (1ULL << 32) : 0; \
+ ___res = (___res >> 32) + ___t; \
+ } \
+ \
+ ___res += (uint64_t)___m_hi * ___n_hi; \
+ \
+ ___res / ___p; \
+})
+
extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
/* The unnecessary pointer compare is there
@@ -41,7 +184,20 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
uint32_t __base = (base); \
uint32_t __rem; \
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
- if (likely(((n) >> 32) == 0)) { \
+ if (__builtin_constant_p(__base) && \
+ (__base & (__base - 1)) == 0) { \
+ /* constant power of 2: gcc is fine */ \
+ __rem = (n) & (__base - 1); \
+ (n) /= __base; \
+ } else if ((__GNUC__ >= 4) && \
+ __builtin_constant_p(__base) && \
+ __base != 0) { \
+ uint32_t __res_lo, __n_lo = (n); \
+ (n) = __div64_const32(n, __base); \
+ /* the remainder can be computed with 32-bit regs */ \
+ __res_lo = (n); \
+ __rem = __n_lo - __res_lo * __base; \
+ } else if (likely(((n) >> 32) == 0)) { \
__rem = (uint32_t)(n) % __base; \
(n) = (uint32_t)(n) / __base; \
} else \
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 1:26 ` Nicolas Pitre
@ 2015-10-30 5:41 ` Vineet Gupta
2015-10-30 12:41 ` Måns Rullgård
2015-10-30 12:40 ` Måns Rullgård
2015-10-30 14:28 ` Alexey Brodkin
2 siblings, 1 reply; 24+ messages in thread
From: Vineet Gupta @ 2015-10-30 5:41 UTC (permalink / raw)
To: Nicolas Pitre, Alexey Brodkin
Cc: linux-kernel@vger.kernel.org, linux-snps-arc@lists.infradead.org,
Ingo Molnar, Stephen Hemminger, David S. Miller, Russell King
+CC Claudiu: ARC gcc expert (please scroll to botom)
On Friday 30 October 2015 06:56 AM, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
>> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
>>
>>> Fortunately we already have much better __div64_32() for 32-bit ARM.
>>> There in case of division by constant preprocessor calculates so-called
>>> "magic number" which is later used in multiplications instead of divisions.
>> It's not magic, it is science. :-)
>>
>>> It's really nice and very optimal but obviously works only for ARM
>>> because ARM assembly is involved.
>>>
>>> Now why don't we extend the same approach to all other 32-bit arches
>>> with multiplication part implemented in pure C. With good compiler
>>> resulting assembly will be quite close to manually written assembly.
> Well... not as close at least on ARM. Maybe 2x to 3x more costly than
> the one with assembly. Still better than 100x or so without this
> optimization.
>
>>> But there's at least 1 problem which I don't know how to solve.
>>> Preprocessor magic only happens if __div64_32() is inlined (that's
>>> obvious - preprocessor has to know if divider is constant or not).
>>>
>>> But __div64_32() is already marked as weak function (which in its turn
>>> is required to allow some architectures to provide its own optimal
>>> implementations). I.e. addition of "inline" for __div64_32() is not an
>>> option.
>> You can't inline __div64_32(). It should remain as is and used only for
>> the slow path.
>>
>> For the constant based optimization to work, you need to modify do_div()
>> in include/asm-generic/div64.h directly.
> OK... I was intrigued, so I adapted my ARM code to the generic case,
> including the overflow avoidance optimizations. Please have look and
> tell me how this works for you.
>
> If this patch is accepted upstream, then it could be possible to
> abstract only the actual multiplication part with some architecture
> specific assembly.
>
> diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
> index 8f4e319334..43c3b21dca 100644
> --- a/include/asm-generic/div64.h
> +++ b/include/asm-generic/div64.h
> @@ -32,6 +32,149 @@
>
> #elif BITS_PER_LONG == 32
>
> +/* macroized fls implementation to ensure proper constant propagation */
> +#define __div64_fls(bits) \
> +({ \
> + unsigned int __left = (bits), __nr = 0; \
> + if (__left & 0xffff0000) __nr += 16, __left >>= 16; \
> + if (__left & 0x0000ff00) __nr += 8, __left >>= 8; \
> + if (__left & 0x000000f0) __nr += 4, __left >>= 4; \
> + if (__left & 0x0000000c) __nr += 2, __left >>= 2; \
> + if (__left & 0x00000002) __nr += 1; \
> + __nr; \
> +})
> +
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications which ought to be much faster. And yet only if compiling
> + * with a sufficiently recent gcc version to perform proper 64-bit constant
> + * propagation.
> + *
> + * (It is unfortunate that gcc doesn't perform all this internally.)
> + */
> +#define __div64_const32(n, ___b) \
> +({ \
> + /* \
> + * Multiplication by reciprocal of b: n / b = n * (p / b) / p \
> + * \
> + * We rely on the fact that most of this code gets optimized \
> + * away at compile time due to constant propagation and only \
> + * a few multiplication instructions should remain. \
> + * Hence this monstrous macro (static inline doesn't always \
> + * do the trick for some reason). \
> + */ \
> + uint64_t ___res, ___x, ___t, ___m, ___n = (n); \
> + uint32_t ___c, ___p, ___m_lo, ___m_hi, ___n_lo, ___n_hi; \
> + \
> + /* determine number of bits to represent b */ \
> + ___p = 1 << __div64_fls(___b); \
> + \
> + /* compute m = ((p << 64) + b - 1) / b */ \
> + ___m = (~0ULL / ___b) * ___p; \
> + ___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b; \
> + \
> + /* dividend that produces one less than the highest result */ \
> + ___x = ~0ULL / ___b * ___b - 1; \
> + \
> + /* test our m with res = m * x / (p << 64) */ \
> + ___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32; \
> + ___t = ___res += (___m & 0xffffffff) * (___x >> 32); \
> + ___res += (___x & 0xffffffff) * (___m >> 32); \
> + ___t = (___res < ___t) ? (1ULL << 32) : 0; \
> + ___res = (___res >> 32) + ___t; \
> + ___res += (___m >> 32) * (___x >> 32); \
> + ___res /= ___p; \
> + \
> + /* Now sanitize and optimize what we've got. */ \
> + if (~0ULL % (___b / (___b & -___b)) == 0) { \
> + /* those cases can be simplified with: */ \
> + ___n /= (___b & -___b); \
> + ___m = ~0ULL / (___b / (___b & -___b)); \
> + ___p = 1; \
> + ___c = 1; \
> + } else if (___res != ___x / ___b) { \
> + /* \
> + * We can't get away without a correction to compensate \
> + * for bit truncation errors. To avoid it we'd need an \
> + * additional bit to represent m which would overflow \
> + * our 64-bit variable. \
> + * \
> + * Instead we do m = p / b and n / b = (n * m + m) / p. \
> + */ \
> + ___c = 1; \
> + /* Compute m = (p << 64) / b */ \
> + ___m = (~0ULL / ___b) * ___p; \
> + ___m += ((~0ULL % ___b + 1) * ___p) / ___b; \
> + } else { \
> + /* \
> + * Reduce m / p, and try to clear bit 31 of m when \
> + * possible, otherwise that'll need extra overflow \
> + * handling later. \
> + */ \
> + unsigned int ___bits = -(___m & -___m); \
> + ___bits |= ___m >> 32; \
> + ___bits = (~___bits) << 1; \
> + /* \
> + * If ___bits == 0 then setting bit 31 is unavoidable. \
> + * Simply apply the maximum possible reduction in that \
> + * case. Otherwise the MSB of ___bits indicates the \
> + * best reduction we should apply. \
> + */ \
> + if (!___bits) { \
> + ___p /= (___m & -___m); \
> + ___m /= (___m & -___m); \
> + } else { \
> + ___p >>= __div64_fls(___bits); \
> + ___m >>= __div64_fls(___bits); \
> + } \
> + /* No correction needed. */ \
> + ___c = 0; \
> + } \
> + \
> + /* \
> + * Now we have a combination of 2 conditions: \
> + * \
> + * 1) whether or not we need a correction (___c), and \
> + * \
> + * 2) whether or not there might be an overflow in the cross \
> + * product determined by (___m & ((1 << 63) | (1 << 31))). \
> + * \
> + * Select the best way to do the m * n / (p << 64) operation. \
> + * From here on there will be actual runtime code generated. \
> + */ \
> + \
> + ___m_lo = ___m; \
> + ___m_hi = ___m >> 32; \
> + ___n_lo = ___n; \
> + ___n_hi = ___n >> 32; \
> + \
> + if (!___c) { \
> + ___res = ((uint64_t)___m_lo * ___n_lo) >> 32; \
> + } else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) { \
> + ___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32; \
> + } else { \
> + ___res = ___m + (uint64_t)___m_lo * ___n_lo; \
> + ___t = (___res < ___m) ? (1ULL << 32) : 0; \
> + ___res = (___res >> 32) + ___t; \
> + } \
> + \
> + if (!(___m & ((1ULL << 63) | (1ULL << 31)))) { \
> + ___res += (uint64_t)___m_lo * ___n_hi; \
> + ___res += (uint64_t)___m_hi * ___n_lo; \
> + ___res >>= 32; \
> + } else { \
> + ___t = ___res += (uint64_t)___m_lo * ___n_hi; \
> + ___res += (uint64_t)___m_hi * ___n_lo; \
> + ___t = (___res < ___t) ? (1ULL << 32) : 0; \
> + ___res = (___res >> 32) + ___t; \
> + } \
> + \
> + ___res += (uint64_t)___m_hi * ___n_hi; \
> + \
> + ___res / ___p; \
> +})
> +
> extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
>
> /* The unnecessary pointer compare is there
> @@ -41,7 +184,20 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
> uint32_t __base = (base); \
> uint32_t __rem; \
> (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
> - if (likely(((n) >> 32) == 0)) { \
> + if (__builtin_constant_p(__base) && \
> + (__base & (__base - 1)) == 0) { \
> + /* constant power of 2: gcc is fine */ \
> + __rem = (n) & (__base - 1); \
> + (n) /= __base; \
> + } else if ((__GNUC__ >= 4) && \
> + __builtin_constant_p(__base) && \
> + __base != 0) { \
> + uint32_t __res_lo, __n_lo = (n); \
> + (n) = __div64_const32(n, __base); \
> + /* the remainder can be computed with 32-bit regs */ \
> + __res_lo = (n); \
> + __rem = __n_lo - __res_lo * __base; \
> + } else if (likely(((n) >> 32) == 0)) { \
> __rem = (uint32_t)(n) % __base; \
> (n) = (uint32_t)(n) / __base; \
> } else \
>
Claudiu, can some of this not be done in gcc itself !
-Vineet
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 1:26 ` Nicolas Pitre
2015-10-30 5:41 ` Vineet Gupta
@ 2015-10-30 12:40 ` Måns Rullgård
2015-10-30 15:17 ` Nicolas Pitre
2015-10-30 14:28 ` Alexey Brodkin
2 siblings, 1 reply; 24+ messages in thread
From: Måns Rullgård @ 2015-10-30 12:40 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Alexey Brodkin, linux-kernel, linux-snps-arc, Vineet Gupta,
Ingo Molnar, Stephen Hemminger, David S. Miller, Russell King
Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
>> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
>>
>> > Fortunately we already have much better __div64_32() for 32-bit ARM.
>> > There in case of division by constant preprocessor calculates so-called
>> > "magic number" which is later used in multiplications instead of divisions.
>>
>> It's not magic, it is science. :-)
>>
>> > It's really nice and very optimal but obviously works only for ARM
>> > because ARM assembly is involved.
>> >
>> > Now why don't we extend the same approach to all other 32-bit arches
>> > with multiplication part implemented in pure C. With good compiler
>> > resulting assembly will be quite close to manually written assembly.
>
> Well... not as close at least on ARM. Maybe 2x to 3x more costly than
> the one with assembly. Still better than 100x or so without this
> optimization.
That's more or less what I found on MIPS too.
>> > But there's at least 1 problem which I don't know how to solve.
>> > Preprocessor magic only happens if __div64_32() is inlined (that's
>> > obvious - preprocessor has to know if divider is constant or not).
>> >
>> > But __div64_32() is already marked as weak function (which in its turn
>> > is required to allow some architectures to provide its own optimal
>> > implementations). I.e. addition of "inline" for __div64_32() is not an
>> > option.
>>
>> You can't inline __div64_32(). It should remain as is and used only for
>> the slow path.
>>
>> For the constant based optimization to work, you need to modify do_div()
>> in include/asm-generic/div64.h directly.
>
> OK... I was intrigued, so I adapted my ARM code to the generic case,
> including the overflow avoidance optimizations. Please have look and
> tell me how this works for you.
>
> If this patch is accepted upstream, then it could be possible to
> abstract only the actual multiplication part with some architecture
> specific assembly.
Good idea.
--
Måns Rullgård
mans@mansr.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 5:41 ` Vineet Gupta
@ 2015-10-30 12:41 ` Måns Rullgård
0 siblings, 0 replies; 24+ messages in thread
From: Måns Rullgård @ 2015-10-30 12:41 UTC (permalink / raw)
To: Vineet Gupta
Cc: Nicolas Pitre, Alexey Brodkin, linux-kernel@vger.kernel.org,
Russell King, Ingo Molnar, linux-snps-arc@lists.infradead.org,
Stephen Hemminger, David S. Miller
Vineet Gupta <Vineet.Gupta1@synopsys.com> writes:
> Claudiu, can some of this not be done in gcc itself !
All of it could be done by gcc. The trouble is that it isn't.
--
Måns Rullgård
mans@mansr.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 1:26 ` Nicolas Pitre
2015-10-30 5:41 ` Vineet Gupta
2015-10-30 12:40 ` Måns Rullgård
@ 2015-10-30 14:28 ` Alexey Brodkin
2 siblings, 0 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-30 14:28 UTC (permalink / raw)
To: nicolas.pitre@linaro.org
Cc: rmk+kernel@arm.linux.org.uk, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, shemminger@linux-foundation.org,
mingo@elte.hu, linux-snps-arc@lists.infradead.org,
davem@davemloft.net
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4738 bytes --]
Hi Nicolas,
On Thu, 2015-10-29 at 21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> >
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> >
> > It's not magic, it is science. :-)
> >
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > >
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
>
> Well... not as close at least on ARM. Maybe 2x to 3x more costly than
> the one with assembly. Still better than 100x or so without this
> optimization.
Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.
My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
return do_div(data, 1000);
}
--------------------->8------------------------
Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
62c: 19 28 86 0f 4f 8d 3b df mpydu r6,r0,0x8d4fdf3b
634: 00 26 86 8f 4f 8d 3b df add.f r6,r6,0x8d4fdf3b
63c: 02 27 87 0f ed 7c 69 91 sub r7,r7,0x7ced9169
644: c0 27 65 00 add.c r7,r7,1
648: 85 0e c4 71 12 83 97 6e brlo 0x83126e97,r7,6cc <myfunc+0xa0>
650: 75 0f 80 0f 12 83 97 6e breq r7,0x83126e97,6c4 <myfunc+0x98>
658: 0d 70 mov_s r8,0
65a: 2d 71 mov_s r9,1
65c: 19 29 8a 0f 4f 8d 3b df mpydu r10,r1,0x8d4fdf3b
664: 00 27 84 82 add.f r4,r7,r10
668: ac 70 mov_s r5,0
66a: 19 28 86 0f 12 83 97 6e mpydu r6,r0,0x83126e97
672: 01 25 c5 02 adc r5,r5,r11
676: 00 24 04 82 add.f r4,r4,r8
67a: 00 25 45 02 add r5,r5,r9
67e: c0 25 65 00 add.c r5,r5,1
682: 00 26 06 81 add.f r6,r6,r4
686: 01 27 47 01 adc r7,r7,r5
68a: 51 0f 44 01 brlo r7,r5,6d8 <myfunc+0xac>
68e: 49 0d c0 01 breq r5,r7,6d4 <myfunc+0xa8>
692: 8c 70 mov_s r4,0
694: ac 70 mov_s r5,0
696: e0 42 mov_s r2,r7
698: 19 29 86 0f 12 83 97 6e mpydu r6,r1,0x83126e97
6a0: 00 22 82 81 add.f r2,r2,r6
6a4: 6c 70 mov_s r3,0
6a6: 01 23 c3 01 adc r3,r3,r7
6aa: 00 22 02 81 add.f r2,r2,r4
6ae: a0 73 add_s r3,r3,r5
6b0: c0 23 65 00 add.c r3,r3,1
6b4: 29 ba lsr_s r2,r2,9
6b6: 17 bb asl_s r3,r3,23
6b8: 65 7a or_s r2,r2,r3
6ba: 9a 22 0f 0a mpy r2,r2,0x3e8
6be: e0 7f j_s.d [blink]
6c0: 42 78 sub_s r0,r0,r2
6c2: e0 78 nop_s
6c4: 95 0e 85 f1 4f 8d 3a df brhs.nt 0x8d4fdf3a,r6,658 <myfunc+0x2c>
6cc: 0d 70 mov_s r8,0
6ce: 91 07 ef ff b.d 65c <myfunc+0x30>
6d2: 2d 70 mov_s r9,0
6d4: bf 0e 05 81 brhs.nt r6,r4,692 <myfunc+0x66>
6d8: 8c 70 mov_s r4,0
6da: bf 07 ef ff b.d 696 <myfunc+0x6a>
6de: ac 71 mov_s r5,1
--------------------->8------------------------
What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
[1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
that returns 64-bit result in register pair.
[2] Indeed lots of moves and additions happen here.
So my conclusion would be:
[1] Proposed implementation makes perfect sense because already speeds-up do_div()
significantly.
[2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
because with few lines of assembly another 2-4 times of improvement could be
achieved.
-Alexeyÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 12:40 ` Måns Rullgård
@ 2015-10-30 15:17 ` Nicolas Pitre
2015-10-30 15:54 ` Alexey Brodkin
0 siblings, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-30 15:17 UTC (permalink / raw)
To: Måns Rullgård
Cc: Alexey Brodkin, linux-kernel, linux-snps-arc, Vineet Gupta,
Ingo Molnar, Stephen Hemminger, David S. Miller, Russell King
[-- Attachment #1: Type: text/plain, Size: 534 bytes --]
On Fri, 30 Oct 2015, Måns Rullgård wrote:
> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>
> > OK... I was intrigued, so I adapted my ARM code to the generic case,
> > including the overflow avoidance optimizations. Please have look and
> > tell me how this works for you.
> >
> > If this patch is accepted upstream, then it could be possible to
> > abstract only the actual multiplication part with some architecture
> > specific assembly.
>
> Good idea.
Could you please provide a reviewed-by or acked-by tag?
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 15:17 ` Nicolas Pitre
@ 2015-10-30 15:54 ` Alexey Brodkin
2015-10-30 16:55 ` Nicolas Pitre
0 siblings, 1 reply; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-30 15:54 UTC (permalink / raw)
To: nicolas.pitre@linaro.org
Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
mans@mansr.com, rmk+kernel@arm.linux.org.uk, davem@davemloft.net,
mingo@elte.hu
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1332 bytes --]
Hi Nicolas,
On Fri, 2015-10-30 at 11:17 -0400, Nicolas Pitre wrote:
> On Fri, 30 Oct 2015, Måns Rullgård wrote:
>
> > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> >
> > > OK... I was intrigued, so I adapted my ARM code to the generic case,
> > > including the overflow avoidance optimizations. Please have look and
> > > tell me how this works for you.
> > >
> > > If this patch is accepted upstream, then it could be possible to
> > > abstract only the actual multiplication part with some architecture
> > > specific assembly.
> >
> > Good idea.
>
> Could you please provide a reviewed-by or acked-by tag?
Sure!
Acked-by: Alexey Brodkin <abrodkin@synopsys.com>
BTW I thought about that optimization a bit more and now I think
we may even skip addition of arch-specific assembly insertions.
That's because that kind of division as discussed many times
should be used as limited as possible, in other words there should be
just a very few usages of it especially in very frequently used code paths.
And in that case there might be not much of benefit having do_div()
even faster and smaller than the one we're about to get with your change.
-Alexeyÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 15:54 ` Alexey Brodkin
@ 2015-10-30 16:55 ` Nicolas Pitre
2015-10-30 17:45 ` Måns Rullgård
0 siblings, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-30 16:55 UTC (permalink / raw)
To: Alexey Brodkin
Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
mans@mansr.com, rmk+kernel@arm.linux.org.uk, davem@davemloft.net,
mingo@elte.hu
[-- Attachment #1: Type: text/plain, Size: 1096 bytes --]
On Fri, 30 Oct 2015, Alexey Brodkin wrote:
> Hi Nicolas,
>
> On Fri, 2015-10-30 at 11:17 -0400, Nicolas Pitre wrote:
> > On Fri, 30 Oct 2015, Måns Rullgård wrote:
> >
> > > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> > >
> > > > OK... I was intrigued, so I adapted my ARM code to the generic case,
> > > > including the overflow avoidance optimizations. Please have look and
> > > > tell me how this works for you.
> > > >
> > > > If this patch is accepted upstream, then it could be possible to
> > > > abstract only the actual multiplication part with some architecture
> > > > specific assembly.
> > >
> > > Good idea.
> >
> > Could you please provide a reviewed-by or acked-by tag?
>
> Sure!
>
> Acked-by: Alexey Brodkin <abrodkin@synopsys.com>
>
> BTW I thought about that optimization a bit more and now I think
> we may even skip addition of arch-specific assembly insertions.
I'm going to do it anyway given that I already have it for ARM. It'll
be opt-in, so if your arch doesn't provide it then the current C
implementation will be used by default.
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 16:55 ` Nicolas Pitre
@ 2015-10-30 17:45 ` Måns Rullgård
2015-11-04 23:46 ` Nicolas Pitre
0 siblings, 1 reply; 24+ messages in thread
From: Måns Rullgård @ 2015-10-30 17:45 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Alexey Brodkin, shemminger@linux-foundation.org,
linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
davem@davemloft.net, mingo@elte.hu
Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> On Fri, 30 Oct 2015, Alexey Brodkin wrote:
>
>> Hi Nicolas,
>>
>> On Fri, 2015-10-30 at 11:17 -0400, Nicolas Pitre wrote:
>> > On Fri, 30 Oct 2015, Måns Rullgård wrote:
>> >
>> > > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>> > >
>> > > > OK... I was intrigued, so I adapted my ARM code to the generic case,
>> > > > including the overflow avoidance optimizations. Please have look and
>> > > > tell me how this works for you.
>> > > >
>> > > > If this patch is accepted upstream, then it could be possible to
>> > > > abstract only the actual multiplication part with some architecture
>> > > > specific assembly.
>> > >
>> > > Good idea.
>> >
>> > Could you please provide a reviewed-by or acked-by tag?
>>
>> Sure!
>>
>> Acked-by: Alexey Brodkin <abrodkin@synopsys.com>
>>
>> BTW I thought about that optimization a bit more and now I think
>> we may even skip addition of arch-specific assembly insertions.
>
> I'm going to do it anyway given that I already have it for ARM. It'll
> be opt-in, so if your arch doesn't provide it then the current C
> implementation will be used by default.
Great. I'll try it out on MIPS once you've posted the patch.
--
Måns Rullgård
mans@mansr.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-10-30 17:45 ` Måns Rullgård
@ 2015-11-04 23:46 ` Nicolas Pitre
2015-11-04 23:48 ` Nicolas Pitre
2015-11-04 23:49 ` Måns Rullgård
0 siblings, 2 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-11-04 23:46 UTC (permalink / raw)
To: Måns Rullgård
Cc: Alexey Brodkin, shemminger@linux-foundation.org,
linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
davem@davemloft.net, mingo@elte.hu
[-- Attachment #1: Type: text/plain, Size: 525 bytes --]
On Fri, 30 Oct 2015, Måns Rullgård wrote:
> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>
> > I'm going to do it anyway given that I already have it for ARM. It'll
> > be opt-in, so if your arch doesn't provide it then the current C
> > implementation will be used by default.
>
> Great. I'll try it out on MIPS once you've posted the patch.
You should have seen the patches by now.
I've put them along with a bunch of do_div() usage fixes here:
http://git.linaro.org/people/nicolas.pitre/linux.git
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-11-04 23:46 ` Nicolas Pitre
@ 2015-11-04 23:48 ` Nicolas Pitre
2015-11-05 3:13 ` Vineet Gupta
2015-11-04 23:49 ` Måns Rullgård
1 sibling, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2015-11-04 23:48 UTC (permalink / raw)
To: Måns Rullgård
Cc: Alexey Brodkin, shemminger@linux-foundation.org,
linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
davem@davemloft.net, mingo@elte.hu
[-- Attachment #1: Type: text/plain, Size: 676 bytes --]
On Wed, 4 Nov 2015, Nicolas Pitre wrote:
> On Fri, 30 Oct 2015, Måns Rullgård wrote:
>
> > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> >
> > > I'm going to do it anyway given that I already have it for ARM. It'll
> > > be opt-in, so if your arch doesn't provide it then the current C
> > > implementation will be used by default.
> >
> > Great. I'll try it out on MIPS once you've posted the patch.
>
> You should have seen the patches by now.
>
> I've put them along with a bunch of do_div() usage fixes here:
>
> http://git.linaro.org/people/nicolas.pitre/linux.git
More precisely:
http://git.linaro.org/people/nicolas.pitre/linux.git div64
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-11-04 23:46 ` Nicolas Pitre
2015-11-04 23:48 ` Nicolas Pitre
@ 2015-11-04 23:49 ` Måns Rullgård
1 sibling, 0 replies; 24+ messages in thread
From: Måns Rullgård @ 2015-11-04 23:49 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Alexey Brodkin, shemminger@linux-foundation.org,
linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
davem@davemloft.net, mingo@elte.hu
Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> On Fri, 30 Oct 2015, Måns Rullgård wrote:
>
>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>>
>> > I'm going to do it anyway given that I already have it for ARM. It'll
>> > be opt-in, so if your arch doesn't provide it then the current C
>> > implementation will be used by default.
>>
>> Great. I'll try it out on MIPS once you've posted the patch.
>
> You should have seen the patches by now.
Yes, I've seen them. I've been busy with other things but hope get to
it before the end of the week. At first glance, it looks good to me.
--
Måns Rullgård
mans@mansr.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-11-04 23:48 ` Nicolas Pitre
@ 2015-11-05 3:13 ` Vineet Gupta
2015-11-05 5:06 ` Nicolas Pitre
0 siblings, 1 reply; 24+ messages in thread
From: Vineet Gupta @ 2015-11-05 3:13 UTC (permalink / raw)
To: Nicolas Pitre, Måns Rullgård
Cc: Alexey Brodkin, shemminger@linux-foundation.org,
linux-kernel@vger.kernel.org, linux-snps-arc@lists.infradead.org,
rmk+kernel@arm.linux.org.uk, davem@davemloft.net, mingo@elte.hu
On Thursday 05 November 2015 05:18 AM, Nicolas Pitre wrote:
> On Wed, 4 Nov 2015, Nicolas Pitre wrote:
>
>> On Fri, 30 Oct 2015, Måns Rullgård wrote:
>>
>>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>>>
>>>> I'm going to do it anyway given that I already have it for ARM. It'll
>>>> be opt-in, so if your arch doesn't provide it then the current C
>>>> implementation will be used by default.
>>> Great. I'll try it out on MIPS once you've posted the patch.
>> You should have seen the patches by now.
>>
>> I've put them along with a bunch of do_div() usage fixes here:
>>
>> http://git.linaro.org/people/nicolas.pitre/linux.git
> More precisely:
>
> http://git.linaro.org/people/nicolas.pitre/linux.git div64
Hi Nico,
While we are current on the topic I was wondering about another optimization in
this area.
The slowpath __div64_32() generates both quotient and remainder. The more general
use case in kernel only cares about quotient.
git grep "\sdo_div(" | wc -l
841
git grep "=\sdo_div(" | wc -l
116
Is it possible to optimize the code, if remainder was *not* needed explicitly. I
understand that the hand divide will still need some sort of running tally of
remainder but can the code be any better in this case. That way we can introduce
another API do_div_norem() and start proliferating it for the cases where
remainder is not used.
Thx,
-Vineet
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
2015-11-05 3:13 ` Vineet Gupta
@ 2015-11-05 5:06 ` Nicolas Pitre
0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-11-05 5:06 UTC (permalink / raw)
To: Vineet Gupta
Cc: Måns Rullgård, Alexey Brodkin,
shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
davem@davemloft.net, mingo@elte.hu
[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]
On Thu, 5 Nov 2015, Vineet Gupta wrote:
> On Thursday 05 November 2015 05:18 AM, Nicolas Pitre wrote:
> > On Wed, 4 Nov 2015, Nicolas Pitre wrote:
> >
> >> On Fri, 30 Oct 2015, Måns Rullgård wrote:
> >>
> >>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> >>>
> >>>> I'm going to do it anyway given that I already have it for ARM. It'll
> >>>> be opt-in, so if your arch doesn't provide it then the current C
> >>>> implementation will be used by default.
> >>> Great. I'll try it out on MIPS once you've posted the patch.
> >> You should have seen the patches by now.
> >>
> >> I've put them along with a bunch of do_div() usage fixes here:
> >>
> >> http://git.linaro.org/people/nicolas.pitre/linux.git
> > More precisely:
> >
> > http://git.linaro.org/people/nicolas.pitre/linux.git div64
>
> Hi Nico,
>
> While we are current on the topic I was wondering about another optimization in
> this area.
> The slowpath __div64_32() generates both quotient and remainder. The more general
> use case in kernel only cares about quotient.
>
> git grep "\sdo_div(" | wc -l
> 841
> git grep "=\sdo_div(" | wc -l
> 116
>
> Is it possible to optimize the code, if remainder was *not* needed explicitly. I
> understand that the hand divide will still need some sort of running tally of
> remainder but can the code be any better in this case. That way we can introduce
> another API do_div_norem() and start proliferating it for the cases where
> remainder is not used.
I don't think you'll be able to optimize the code much. If you look at
the division loop, you always have the current remainder to process as
you say, so when you can't substract from the remainder anymore you
simply return that value. And on ARM we simply tell the calling code
about which register contains the remainder if it wants it. Therefore
on ARM the code would be exactly the same in either cases.
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2015-11-05 5:06 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-29 7:34 ` Alexey Brodkin
2015-10-30 1:26 ` Nicolas Pitre
2015-10-30 5:41 ` Vineet Gupta
2015-10-30 12:41 ` Måns Rullgård
2015-10-30 12:40 ` Måns Rullgård
2015-10-30 15:17 ` Nicolas Pitre
2015-10-30 15:54 ` Alexey Brodkin
2015-10-30 16:55 ` Nicolas Pitre
2015-10-30 17:45 ` Måns Rullgård
2015-11-04 23:46 ` Nicolas Pitre
2015-11-04 23:48 ` Nicolas Pitre
2015-11-05 3:13 ` Vineet Gupta
2015-11-05 5:06 ` Nicolas Pitre
2015-11-04 23:49 ` Måns Rullgård
2015-10-30 14:28 ` Alexey Brodkin
2015-10-29 0:36 ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 13:05 ` Alexey Brodkin
2015-10-29 13:37 ` Måns Rullgård
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 14:32 ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox