[PATCH] __div64_32: implement division by multiplication for 32-bit arches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] __div64_32: implement division by multiplication for 32-bit arches
@ 2015-10-28 22:47 Alexey Brodkin
  2015-10-28 23:32 ` Nicolas Pitre
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-28 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexey Brodkin, linux-snps-arc, Vineet Gupta, Ingo Molnar,
	Stephen Hemminger, David S. Miller, Nicolas Pitre, Russell King

Existing default implementation of __div64_32() for 32-bit arches unfolds
into huge routine with tons of arithmetics like +, -, * and all of them
in loops. That leads to obvious performance degradation if do_div() is
frequently used.

Good example is extensive TCP/IP traffic.
That's what I'm getting with perf out of iperf3:
 -------------->8--------------
    30.05%  iperf3   [kernel.kallsyms]        [k] copy_from_iter
    11.77%  iperf3   [kernel.kallsyms]        [k] __div64_32
     5.44%  iperf3   [kernel.kallsyms]        [k] memset
     5.32%  iperf3   [kernel.kallsyms]        [k] stmmac_xmit
     2.70%  iperf3   [kernel.kallsyms]        [k] skb_segment
     2.56%  iperf3   [kernel.kallsyms]        [k] tcp_ack
 -------------->8--------------

do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
received from local_clock() to microseconds used in timestamp.
BTW conversion itself is as simple as "/=1000".

Fortunately we already have much better __div64_32() for 32-bit ARM.
There in case of division by constant preprocessor calculates so-called
"magic number" which is later used in multiplications instead of divisions.
It's really nice and very optimal but obviously works only for ARM
because ARM assembly is involved.

Now why don't we extend the same approach to all other 32-bit arches
with multiplication part implemented in pure C. With good compiler
resulting assembly will be quite close to manually written assembly.

And that change implements that.

But there's at least 1 problem which I don't know how to solve.
Preprocessor magic only happens if __div64_32() is inlined (that's
obvious - preprocessor has to know if divider is constant or not).

But __div64_32() is already marked as weak function (which in its turn
is required to allow some architectures to provide its own optimal
implementations). I.e. addition of "inline" for __div64_32() is not an
option.

So I do want to hear opinions on how to proceed with that patch.
Indeed there's the simplest solution - use this implementation only in
my architecture of preference (read ARC) but IMHO this change may
benefit other architectures as well.

Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Nicolas Pitre <nico@cam.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
---
 lib/div64.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 128 insertions(+), 25 deletions(-)

diff --git a/lib/div64.c b/lib/div64.c
index 62a698a..3055328 100644
--- a/lib/div64.c
+++ b/lib/div64.c
@@ -23,37 +23,140 @@
 /* Not needed on 64bit architectures */
 #if BITS_PER_LONG == 32
 
+/* our own fls implementation to make sure constant propagation is fine */
+inline int __div64_fls(int bits)
+{
+	unsigned int __left = bits, __nr = 0;
+
+	if (__left & 0xffff0000)
+		__nr += 16, __left >>= 16;
+
+	if (__left & 0x0000ff00)
+		__nr +=  8, __left >>=  8;
+
+	if (__left & 0x000000f0)
+		__nr +=  4, __left >>=  4;
+
+	if (__left & 0x0000000c)
+		__nr +=  2, __left >>=  2;
+
+	if (__left & 0x00000002)
+		__nr +=  1;
+
+	return __nr;
+}
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications instead which is much faster.
+ */
 uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
 {
-	uint64_t rem = *n;
-	uint64_t b = base;
-	uint64_t res, d = 1;
-	uint32_t high = rem >> 32;
-
-	/* Reduce the thing a bit first */
-	res = 0;
-	if (high >= base) {
-		high /= base;
-		res = (uint64_t) high << 32;
-		rem -= (uint64_t) (high*base) << 32;
-	}
+	unsigned int __r, __b = base;
 
-	while ((int64_t)b > 0 && b < rem) {
-		b = b+b;
-		d = d+d;
-	}
+	if (!__builtin_constant_p(__b) || __b == 0) {
+		/* non-constant divisor (or zero): slow path */
+		uint64_t rem = *n;
+		uint64_t b = base;
+		uint64_t res, d = 1;
+		uint32_t high = rem >> 32;
+
+		/* Reduce the thing a bit first */
+		res = 0;
+		if (high >= base) {
+			high /= base;
+			res = (uint64_t) high << 32;
+			rem -= (uint64_t) (high*base) << 32;
+		}
+
+		while ((int64_t)b > 0 && b < rem) {
+			b = b+b;
+			d = d+d;
+		}
+
+		do {
+			if (rem >= b) {
+				rem -= b;
+				res += d;
+			}
+			b >>= 1;
+			d >>= 1;
+		} while (d);
 
-	do {
-		if (rem >= b) {
-			rem -= b;
-			res += d;
+		*n = res;
+		__r = rem;
+	} else if ((__b & (__b - 1)) == 0) {
+		/*
+		 * Trivial: __b is constant and a power of 2
+		 * gcc does the right thing with this code.
+		 * Even though code is the same as above but
+		 * we make it visually as a separate path.
+		 * Still only one of these branches will survive
+		 * pre-processor stage, so let's leave it here.
+		 */
+		__r = *n;
+		__r &= (__b - 1);
+		*n /= __b;
+	} else {
+		/* Start of preprocessor calculations */
+
+		/*
+		 * Multiply by inverse of __b: *n/b = *n*(p/b)/p
+		 * We rely on the fact that most of this code gets
+		 * optimized away at compile time due to constant
+		 * propagation and only a couple inline assembly
+		 * instructions should remain. Better avoid any
+		 * code construct that might prevent that.
+		 */
+		unsigned long long __res, __x, __t, __m, __n = *n;
+		unsigned int __p;
+		/* preserve low part of *n for reminder computation */
+		__r = __n;
+		/* determine number of bits to represent __b */
+		__p = 1 << __div64_fls(__b);
+		/* compute __m = ((__p << 64) + __b - 1) / __b */
+		__m = (~0ULL / __b) * __p;
+		__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;
+		/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */
+		__x = ~0ULL / __b * __b - 1;
+		__res = (__m & 0xffffffff) * (__x & 0xffffffff);
+		__res >>= 32;
+		__res += (__m & 0xffffffff) * (__x >> 32);
+		__t = __res;
+		__res += (__x & 0xffffffff) * (__m >> 32);
+		__t = (__res < __t) ? (1ULL << 32) : 0;
+		__res = (__res >> 32) + __t;
+		__res += (__m >> 32) * (__x >> 32);
+		__res /= __p;
+		/* End of preprocessor calculations */
+
+		/* Start of run-time calculations */
+		__res = (unsigned int)__m * (unsigned int)__n;
+		__res >>= 32;
+		__res += (unsigned int)__m * (__n >> 32);
+		__t = __res;
+		__res += (unsigned int)__n * (__m >> 32);
+		__t = (__res < __t) ? (1ULL << 32) : 0;
+		__res = (__res >> 32) + __t;
+		__res += (__m >> 32) * (__n >> 32);
+		__res /= __p;
+
+		/*
+		 * The reminder can be computed with 32-bit regs
+		 * only, and gcc is good at that.
+		 */
+		{
+			unsigned int __res0 = __res;
+			unsigned int __b0 = __b;
+
+			__r -= __res0 * __b0;
 		}
-		b >>= 1;
-		d >>= 1;
-	} while (d);
+		/* End of run-time calculations */
 
-	*n = res;
-	return rem;
+		*n = __res;
+	}
+	return __r;
 }
 
 EXPORT_SYMBOL(__div64_32);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
@ 2015-10-28 23:32 ` Nicolas Pitre
  2015-10-29  7:34   ` Alexey Brodkin
  2015-10-30  1:26   ` Nicolas Pitre
  2015-10-29  0:36 ` kbuild test robot
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-28 23:32 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
	Stephen Hemminger, David S. Miller, Russell King

On Thu, 29 Oct 2015, Alexey Brodkin wrote:

> Fortunately we already have much better __div64_32() for 32-bit ARM.
> There in case of division by constant preprocessor calculates so-called
> "magic number" which is later used in multiplications instead of divisions.

It's not magic, it is science.  :-)

> It's really nice and very optimal but obviously works only for ARM
> because ARM assembly is involved.
> 
> Now why don't we extend the same approach to all other 32-bit arches
> with multiplication part implemented in pure C. With good compiler
> resulting assembly will be quite close to manually written assembly.

You appear to have left out optimizations where there is no overflow to 
carry.  That, too, can be determined at compile time.

> But there's at least 1 problem which I don't know how to solve.
> Preprocessor magic only happens if __div64_32() is inlined (that's
> obvious - preprocessor has to know if divider is constant or not).
> 
> But __div64_32() is already marked as weak function (which in its turn
> is required to allow some architectures to provide its own optimal
> implementations). I.e. addition of "inline" for __div64_32() is not an
> option.

You can't inline __div64_32().  It should remain as is and used only for 
the slow path.

For the constant based optimization to work, you need to modify do_div() 
in include/asm-generic/div64.h directly.

> So I do want to hear opinions on how to proceed with that patch.
> Indeed there's the simplest solution - use this implementation only in
> my architecture of preference (read ARC) but IMHO this change may
> benefit other architectures as well.
> 
> Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
> Cc: linux-snps-arc@lists.infradead.org
> Cc: Vineet Gupta <vgupta@synopsys.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Nicolas Pitre <nico@cam.org>

This email address has been unused for the last 7 years. Please update 
your reference.

> Cc: Russell King <rmk+kernel@arm.linux.org.uk>
> ---
>  lib/div64.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 128 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> @@ -23,37 +23,140 @@
>  /* Not needed on 64bit architectures */
>  #if BITS_PER_LONG == 32
>  
> +/* our own fls implementation to make sure constant propagation is fine */
> +inline int __div64_fls(int bits)
> +{
> +	unsigned int __left = bits, __nr = 0;
> +
> +	if (__left & 0xffff0000)
> +		__nr += 16, __left >>= 16;
> +
> +	if (__left & 0x0000ff00)
> +		__nr +=  8, __left >>=  8;
> +
> +	if (__left & 0x000000f0)
> +		__nr +=  4, __left >>=  4;
> +
> +	if (__left & 0x0000000c)
> +		__nr +=  2, __left >>=  2;
> +
> +	if (__left & 0x00000002)
> +		__nr +=  1;
> +
> +	return __nr;
> +}

The regular fls implementation should already give you a constant result 
if provided with a constant input.  To be sure you could use:

	__p = 1 << __fls(__b);
	BUILD_BUG_ON(!__builtin_constant_p(__p));

> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
>  uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
>  {
> -	uint64_t rem = *n;
> -	uint64_t b = base;
> -	uint64_t res, d = 1;
> -	uint32_t high = rem >> 32;
> -
> -	/* Reduce the thing a bit first */
> -	res = 0;
> -	if (high >= base) {
> -		high /= base;
> -		res = (uint64_t) high << 32;
> -		rem -= (uint64_t) (high*base) << 32;
> -	}
> +	unsigned int __r, __b = base;
>  
> -	while ((int64_t)b > 0 && b < rem) {
> -		b = b+b;
> -		d = d+d;
> -	}
> +	if (!__builtin_constant_p(__b) || __b == 0) {
> +		/* non-constant divisor (or zero): slow path */
> +		uint64_t rem = *n;
> +		uint64_t b = base;
> +		uint64_t res, d = 1;
> +		uint32_t high = rem >> 32;
> +
> +		/* Reduce the thing a bit first */
> +		res = 0;
> +		if (high >= base) {
> +			high /= base;
> +			res = (uint64_t) high << 32;
> +			rem -= (uint64_t) (high*base) << 32;
> +		}
> +
> +		while ((int64_t)b > 0 && b < rem) {
> +			b = b+b;
> +			d = d+d;
> +		}
> +
> +		do {
> +			if (rem >= b) {
> +				rem -= b;
> +				res += d;
> +			}
> +			b >>= 1;
> +			d >>= 1;
> +		} while (d);
>  
> -	do {
> -		if (rem >= b) {
> -			rem -= b;
> -			res += d;
> +		*n = res;
> +		__r = rem;
> +	} else if ((__b & (__b - 1)) == 0) {
> +		/*
> +		 * Trivial: __b is constant and a power of 2
> +		 * gcc does the right thing with this code.
> +		 * Even though code is the same as above but
> +		 * we make it visually as a separate path.
> +		 * Still only one of these branches will survive
> +		 * pre-processor stage, so let's leave it here.
> +		 */
> +		__r = *n;
> +		__r &= (__b - 1);
> +		*n /= __b;
> +	} else {
> +		/* Start of preprocessor calculations */
> +
> +		/*
> +		 * Multiply by inverse of __b: *n/b = *n*(p/b)/p
> +		 * We rely on the fact that most of this code gets
> +		 * optimized away at compile time due to constant
> +		 * propagation and only a couple inline assembly
> +		 * instructions should remain. Better avoid any
> +		 * code construct that might prevent that.
> +		 */
> +		unsigned long long __res, __x, __t, __m, __n = *n;
> +		unsigned int __p;
> +		/* preserve low part of *n for reminder computation */
> +		__r = __n;
> +		/* determine number of bits to represent __b */
> +		__p = 1 << __div64_fls(__b);
> +		/* compute __m = ((__p << 64) + __b - 1) / __b */
> +		__m = (~0ULL / __b) * __p;
> +		__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;
> +		/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */
> +		__x = ~0ULL / __b * __b - 1;
> +		__res = (__m & 0xffffffff) * (__x & 0xffffffff);
> +		__res >>= 32;
> +		__res += (__m & 0xffffffff) * (__x >> 32);
> +		__t = __res;
> +		__res += (__x & 0xffffffff) * (__m >> 32);
> +		__t = (__res < __t) ? (1ULL << 32) : 0;
> +		__res = (__res >> 32) + __t;
> +		__res += (__m >> 32) * (__x >> 32);
> +		__res /= __p;
> +		/* End of preprocessor calculations */
> +
> +		/* Start of run-time calculations */
> +		__res = (unsigned int)__m * (unsigned int)__n;
> +		__res >>= 32;
> +		__res += (unsigned int)__m * (__n >> 32);
> +		__t = __res;
> +		__res += (unsigned int)__n * (__m >> 32);
> +		__t = (__res < __t) ? (1ULL << 32) : 0;
> +		__res = (__res >> 32) + __t;
> +		__res += (__m >> 32) * (__n >> 32);
> +		__res /= __p;
> +
> +		/*
> +		 * The reminder can be computed with 32-bit regs
> +		 * only, and gcc is good at that.
> +		 */
> +		{
> +			unsigned int __res0 = __res;
> +			unsigned int __b0 = __b;
> +
> +			__r -= __res0 * __b0;
>  		}
> -		b >>= 1;
> -		d >>= 1;
> -	} while (d);
> +		/* End of run-time calculations */
>  
> -	*n = res;
> -	return rem;
> +		*n = __res;
> +	}
> +	return __r;
>  }
>  
>  EXPORT_SYMBOL(__div64_32);
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
  2015-10-28 23:32 ` Nicolas Pitre
@ 2015-10-29  0:36 ` kbuild test robot
  2015-10-29 12:52 ` Måns Rullgård
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2015-10-29  0:36 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: kbuild-all, linux-kernel, Alexey Brodkin, linux-snps-arc,
	Vineet Gupta, Ingo Molnar, Stephen Hemminger, David S. Miller,
	Nicolas Pitre, Russell King

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

Hi Alexey,

[auto build test ERROR on v4.3-rc7 -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Alexey-Brodkin/__div64_32-implement-division-by-multiplication-for-32-bit-arches/20151029-065010
config: arm-mini2440_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:136:0,
                    from lib/div64.c:20:
>> arch/arm/include/asm/div64.h:215:2: error: expected identifier or '(' before '{' token
    ({         \
     ^
>> lib/div64.c:27:12: note: in expansion of macro '__div64_fls'
    inline int __div64_fls(int bits)
               ^

vim +/__div64_fls +27 lib/div64.c

    14	 * Code generated for this function might be very inefficient
    15	 * for some CPUs. __div64_32() can be overridden by linking arch-specific
    16	 * assembly versions such as arch/ppc/lib/div64.S and arch/sh/lib/div64.S.
    17	 */
    18	
    19	#include <linux/export.h>
  > 20	#include <linux/kernel.h>
    21	#include <linux/math64.h>
    22	
    23	/* Not needed on 64bit architectures */
    24	#if BITS_PER_LONG == 32
    25	
    26	/* our own fls implementation to make sure constant propagation is fine */
  > 27	inline int __div64_fls(int bits)
    28	{
    29		unsigned int __left = bits, __nr = 0;
    30	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 19049 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 23:32 ` Nicolas Pitre
@ 2015-10-29  7:34   ` Alexey Brodkin
  2015-10-30  1:26   ` Nicolas Pitre
  1 sibling, 0 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-29  7:34 UTC (permalink / raw)
  To: nicolas.pitre@linaro.org
  Cc: rmk+kernel@arm.linux.org.uk, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, shemminger@linux-foundation.org,
	mingo@elte.hu, linux-snps-arc@lists.infradead.org,
	davem@davemloft.net

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3083 bytes --]

Hi Nicolas,

On Wed, 2015-10-28 at 19:32 -0400, Nicolas Pitre wrote:
> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> 
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> 
> It's not magic, it is science.  :-)

Indeed, but I was under impression that's how people call that value in that
particular case. So for me it looks appropriate here.

> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> > 
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
> 
> You appear to have left out optimizations where there is no overflow to 
> carry.  That, too, can be determined at compile time.

That might be the case - let me look at that a bit more.
But that was not the biggest problem. I actually wanted to send it
as RFC but due to last minute change I made "git pathc-format -1" and
forgot to change topic from PATCH to RFC.

> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> > 
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> 
> You can't inline __div64_32().  It should remain as is and used only for 
> the slow path.
> 
> For the constant based optimization to work, you need to modify do_div() 
> in include/asm-generic/div64.h directly.

I thought about that but if I replace existing implementation of do_div()
with proposed here some arches like SH and MIPS won't be able to use their
own __div64_32() in do_div() any longer.

So how to deal with that then?

> > So I do want to hear opinions on how to proceed with that patch.
> > Indeed there's the simplest solution - use this implementation only in
> > my architecture of preference (read ARC) but IMHO this change may
> > benefit other architectures as well.
> > 
> > Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
> > Cc: linux-snps-arc@lists.infradead.org
> > Cc: Vineet Gupta <vgupta@synopsys.com>
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> > Cc: David S. Miller <davem@davemloft.net>
> > Cc: Nicolas Pitre <nico@cam.org>
> 
> This email address has been unused for the last 7 years. Please update 
> your reference.

My bad - I blindly took that email from your prehistoric patch
"[ARM] 3611/4: optimize do_div() when divisor is constant".

-Alexeyÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
  2015-10-28 23:32 ` Nicolas Pitre
  2015-10-29  0:36 ` kbuild test robot
@ 2015-10-29 12:52 ` Måns Rullgård
  2015-10-29 13:05   ` Alexey Brodkin
  2015-10-29 13:31 ` Russell King - ARM Linux
  2015-10-29 17:09 ` Randy Dunlap
  4 siblings, 1 reply; 24+ messages in thread
From: Måns Rullgård @ 2015-10-29 12:52 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
	Stephen Hemminger, David S. Miller, Nicolas Pitre, Russell King

Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:

> Existing default implementation of __div64_32() for 32-bit arches unfolds
> into huge routine with tons of arithmetics like +, -, * and all of them
> in loops. That leads to obvious performance degradation if do_div() is
> frequently used.
>
> Good example is extensive TCP/IP traffic.
> That's what I'm getting with perf out of iperf3:
>  -------------->8--------------
>     30.05%  iperf3   [kernel.kallsyms]        [k] copy_from_iter
>     11.77%  iperf3   [kernel.kallsyms]        [k] __div64_32
>      5.44%  iperf3   [kernel.kallsyms]        [k] memset
>      5.32%  iperf3   [kernel.kallsyms]        [k] stmmac_xmit
>      2.70%  iperf3   [kernel.kallsyms]        [k] skb_segment
>      2.56%  iperf3   [kernel.kallsyms]        [k] tcp_ack
>  -------------->8--------------
>
> do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
> received from local_clock() to microseconds used in timestamp.
> BTW conversion itself is as simple as "/=1000".
>
> Fortunately we already have much better __div64_32() for 32-bit ARM.
> There in case of division by constant preprocessor calculates so-called
> "magic number" which is later used in multiplications instead of divisions.
> It's really nice and very optimal but obviously works only for ARM
> because ARM assembly is involved.
>
> Now why don't we extend the same approach to all other 32-bit arches
> with multiplication part implemented in pure C. With good compiler
> resulting assembly will be quite close to manually written assembly.
>
> And that change implements that.
>
> But there's at least 1 problem which I don't know how to solve.
> Preprocessor magic only happens if __div64_32() is inlined (that's
> obvious - preprocessor has to know if divider is constant or not).
>
> But __div64_32() is already marked as weak function (which in its turn
> is required to allow some architectures to provide its own optimal
> implementations). I.e. addition of "inline" for __div64_32() is not an
> option.
>
> So I do want to hear opinions on how to proceed with that patch.
> Indeed there's the simplest solution - use this implementation only in
> my architecture of preference (read ARC) but IMHO this change may
> benefit other architectures as well.

I tried something similar for MIPS a while ago after noticing a similar
perf report.  Adapting Nico's ARM code gave some nice speedups, but only
when I used MIPS assembly for the long multiplies.  Apparently gcc is
still too stupid to do the sane thing.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-29 12:52 ` Måns Rullgård
@ 2015-10-29 13:05   ` Alexey Brodkin
  2015-10-29 13:37     ` Måns Rullgård
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-29 13:05 UTC (permalink / raw)
  To: mans@mansr.com
  Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
	rmk+kernel@arm.linux.org.uk, davem@davemloft.net, mingo@elte.hu,
	nico@cam.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3398 bytes --]

Hi mans,

On Thu, 2015-10-29 at 12:52 +0000, MÃ¥ns RullgÃ¥rd wrote:
> Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
> 
> > Existing default implementation of __div64_32() for 32-bit arches unfolds
> > into huge routine with tons of arithmetics like +, -, * and all of them
> > in loops. That leads to obvious performance degradation if do_div() is
> > frequently used.
> > 
> > Good example is extensive TCP/IP traffic.
> > That's what I'm getting with perf out of iperf3:
> >  -------------->8--------------
> >     30.05%  iperf3   [kernel.kallsyms]        [k] copy_from_iter
> >     11.77%  iperf3   [kernel.kallsyms]        [k] __div64_32
> >      5.44%  iperf3   [kernel.kallsyms]        [k] memset
> >      5.32%  iperf3   [kernel.kallsyms]        [k] stmmac_xmit
> >      2.70%  iperf3   [kernel.kallsyms]        [k] skb_segment
> >      2.56%  iperf3   [kernel.kallsyms]        [k] tcp_ack
> >  -------------->8--------------
> > 
> > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
> > received from local_clock() to microseconds used in timestamp.
> > BTW conversion itself is as simple as "/=1000".
> > 
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> > 
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.
> > 
> > And that change implements that.
> > 
> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> > 
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> > 
> > So I do want to hear opinions on how to proceed with that patch.
> > Indeed there's the simplest solution - use this implementation only in
> > my architecture of preference (read ARC) but IMHO this change may
> > benefit other architectures as well.
> 
> I tried something similar for MIPS a while ago after noticing a similar
> perf report.  Adapting Nico's ARM code gave some nice speedups, but only
> when I used MIPS assembly for the long multiplies.  Apparently gcc is
> still too stupid to do the sane thing.

Could you please elaborate a little bit on what was a problem with gcc
compared to hand-written asm?

The point is if preprocessor does proper constant propagation then compiler
will need to implement only calculations marked "run-time calculations".
And in its turn those are pretty straight-forward 32-bit + and *.

And at least on ARC I saw with that change perf no longer captures
__div64_32() during iperf and iperf results itself improved for about 10%.
So I'd say advantage is quite noticeable.

-Alexey
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
                   ` (2 preceding siblings ...)
  2015-10-29 12:52 ` Måns Rullgård
@ 2015-10-29 13:31 ` Russell King - ARM Linux
  2015-10-29 14:32   ` Alexey Brodkin
  2015-10-29 17:09 ` Randy Dunlap
  4 siblings, 1 reply; 24+ messages in thread
From: Russell King - ARM Linux @ 2015-10-29 13:31 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
	Stephen Hemminger, David S. Miller, Nicolas Pitre

On Thu, Oct 29, 2015 at 01:47:35AM +0300, Alexey Brodkin wrote:
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
>  uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
>  {
> -	uint64_t rem = *n;
> -	uint64_t b = base;
> -	uint64_t res, d = 1;
> -	uint32_t high = rem >> 32;
> -
> -	/* Reduce the thing a bit first */
> -	res = 0;
> -	if (high >= base) {
> -		high /= base;
> -		res = (uint64_t) high << 32;
> -		rem -= (uint64_t) (high*base) << 32;
> -	}
> +	unsigned int __r, __b = base;
>  
> -	while ((int64_t)b > 0 && b < rem) {
> -		b = b+b;
> -		d = d+d;
> -	}
> +	if (!__builtin_constant_p(__b) || __b == 0) {

Can you explain who __builtin_constant_p(__b) can be anything but false
here?  I can't see that this will ever be true.

This is a function in its own .c file - the compiler will have no
knowledge about the callers of this function scattered throughout the
kernel, and it has to assume that the 'base' argument to this function
is variable.  So, __builtin_constant_p(__b) will always be false, which
means this if () statement will always be true and the else clause will
never be used.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-29 13:05   ` Alexey Brodkin
@ 2015-10-29 13:37     ` Måns Rullgård
  0 siblings, 0 replies; 24+ messages in thread
From: Måns Rullgård @ 2015-10-29 13:37 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
	rmk+kernel@arm.linux.org.uk, davem@davemloft.net, mingo@elte.hu,
	Nicolas Pitre

Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:

> Hi mans,
>
> On Thu, 2015-10-29 at 12:52 +0000, Måns Rullgård wrote:
>> Alexey Brodkin <Alexey.Brodkin@synopsys.com> writes:
>> 
>> > Existing default implementation of __div64_32() for 32-bit arches unfolds
>> > into huge routine with tons of arithmetics like +, -, * and all of them
>> > in loops. That leads to obvious performance degradation if do_div() is
>> > frequently used.
>> > 
>> > Good example is extensive TCP/IP traffic.
>> > That's what I'm getting with perf out of iperf3:
>> >  -------------->8--------------
>> >     30.05%  iperf3   [kernel.kallsyms]        [k] copy_from_iter
>> >     11.77%  iperf3   [kernel.kallsyms]        [k] __div64_32
>> >      5.44%  iperf3   [kernel.kallsyms]        [k] memset
>> >      5.32%  iperf3   [kernel.kallsyms]        [k] stmmac_xmit
>> >      2.70%  iperf3   [kernel.kallsyms]        [k] skb_segment
>> >      2.56%  iperf3   [kernel.kallsyms]        [k] tcp_ack
>> >  -------------->8--------------
>> > 
>> > do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
>> > received from local_clock() to microseconds used in timestamp.
>> > BTW conversion itself is as simple as "/=1000".
>> > 
>> > Fortunately we already have much better __div64_32() for 32-bit ARM.
>> > There in case of division by constant preprocessor calculates so-called
>> > "magic number" which is later used in multiplications instead of divisions.
>> > It's really nice and very optimal but obviously works only for ARM
>> > because ARM assembly is involved.
>> > 
>> > Now why don't we extend the same approach to all other 32-bit arches
>> > with multiplication part implemented in pure C. With good compiler
>> > resulting assembly will be quite close to manually written assembly.
>> > 
>> > And that change implements that.
>> > 
>> > But there's at least 1 problem which I don't know how to solve.
>> > Preprocessor magic only happens if __div64_32() is inlined (that's
>> > obvious - preprocessor has to know if divider is constant or not).
>> > 
>> > But __div64_32() is already marked as weak function (which in its turn
>> > is required to allow some architectures to provide its own optimal
>> > implementations). I.e. addition of "inline" for __div64_32() is not an
>> > option.
>> > 
>> > So I do want to hear opinions on how to proceed with that patch.
>> > Indeed there's the simplest solution - use this implementation only in
>> > my architecture of preference (read ARC) but IMHO this change may
>> > benefit other architectures as well.
>> 
>> I tried something similar for MIPS a while ago after noticing a similar
>> perf report.  Adapting Nico's ARM code gave some nice speedups, but only
>> when I used MIPS assembly for the long multiplies.  Apparently gcc is
>> still too stupid to do the sane thing.
>
> Could you please elaborate a little bit on what was a problem with gcc
> compared to hand-written asm?

In the final multiplications (the ones using ARM assembly), gcc has a
tendency to multiply things by zero and add the (zero) result to
something.  This generally happens when multiplying a 64-bit value by a
32-bit one.  The 32-bit value is simply converted to 64-bit by the usual
promotion rules, and gcc forgets that the upper half is know to be zero.

> The point is if preprocessor does proper constant propagation then compiler
> will need to implement only calculations marked "run-time calculations".
> And in its turn those are pretty straight-forward 32-bit + and *.

The constant calculation is fine.  It's the final multiplication that's
the problem.

> And at least on ARC I saw with that change perf no longer captures
> __div64_32() during iperf and iperf results itself improved for about 10%.
> So I'd say advantage is quite noticeable.

There was an improvement without assembly as well, but with the MIPS
equivalent of the ARM assembly, it got much better.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-29 13:31 ` Russell King - ARM Linux
@ 2015-10-29 14:32   ` Alexey Brodkin
  0 siblings, 0 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-29 14:32 UTC (permalink / raw)
  To: linux@arm.linux.org.uk
  Cc: nicolas.pitre@linaro.org, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, shemminger@linux-foundation.org,
	mingo@elte.hu, linux-snps-arc@lists.infradead.org,
	davem@davemloft.net

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2179 bytes --]

Hi Russel,

On Thu, 2015-10-29 at 13:31 +0000, Russell King - ARM Linux wrote:
> On Thu, Oct 29, 2015 at 01:47:35AM +0300, Alexey Brodkin wrote:
> > diff --git a/lib/div64.c b/lib/div64.c
> > index 62a698a..3055328 100644
> > --- a/lib/div64.c
> > +++ b/lib/div64.c
> > +/*
> > + * If the divisor happens to be constant, we determine the appropriate
> > + * inverse at compile time to turn the division into a few inline
> > + * multiplications instead which is much faster.
> > + */
> >  uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
> >  {
> > -	uint64_t rem = *n;
> > -	uint64_t b = base;
> > -	uint64_t res, d = 1;
> > -	uint32_t high = rem >> 32;
> > -
> > -	/* Reduce the thing a bit first */
> > -	res = 0;
> > -	if (high >= base) {
> > -		high /= base;
> > -		res = (uint64_t) high << 32;
> > -		rem -= (uint64_t) (high*base) << 32;
> > -	}
> > +	unsigned int __r, __b = base;
> >  
> > -	while ((int64_t)b > 0 && b < rem) {
> > -		b = b+b;
> > -		d = d+d;
> > -	}
> > +	if (!__builtin_constant_p(__b) || __b == 0) {
> 
> Can you explain who __builtin_constant_p(__b) can be anything but false
> here?  I can't see that this will ever be true.
> 
> This is a function in its own .c file - the compiler will have no
> knowledge about the callers of this function scattered throughout the
> kernel, and it has to assume that the 'base' argument to this function
> is variable.  So, __builtin_constant_p(__b) will always be false, which
> means this if () statement will always be true and the else clause will
> never be used.

Essentially constant propagation will only happen if __div64_32() is inlined.
For that we need to add "inline" specifier to __div64_32(), but that in its
turn will prevent use of arch-specific more optimal __div64_32() implementation.

And that was my main question how to implement this properly: have better generic
do_div() or __div64_32() as its heavy lifting part and still keep an ability for
some architectures to use their own implementations.

-Alexey
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
                   ` (3 preceding siblings ...)
  2015-10-29 13:31 ` Russell King - ARM Linux
@ 2015-10-29 17:09 ` Randy Dunlap
  4 siblings, 0 replies; 24+ messages in thread
From: Randy Dunlap @ 2015-10-29 17:09 UTC (permalink / raw)
  To: Alexey Brodkin, linux-kernel
  Cc: linux-snps-arc, Vineet Gupta, Ingo Molnar, Stephen Hemminger,
	David S. Miller, Nicolas Pitre, Russell King

typos (spellos):

On 10/28/15 15:47, Alexey Brodkin wrote:
> ---
>  lib/div64.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 128 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> @@ -23,37 +23,140 @@
>  /* Not needed on 64bit architectures */
>  #if BITS_PER_LONG == 32
>  
> +
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
>  uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
>  {
> -	uint64_t rem = *n;
> -	uint64_t b = base;
> -	uint64_t res, d = 1;
> -	uint32_t high = rem >> 32;
> -
> -	/* Reduce the thing a bit first */
> -	res = 0;
> -	if (high >= base) {
> -		high /= base;
> -		res = (uint64_t) high << 32;
> -		rem -= (uint64_t) (high*base) << 32;
> -	}
> +	unsigned int __r, __b = base;
>  
> -	while ((int64_t)b > 0 && b < rem) {
> -		b = b+b;
> -		d = d+d;
> -	}
> +	if (!__builtin_constant_p(__b) || __b == 0) {
> +		/* non-constant divisor (or zero): slow path */
> +		uint64_t rem = *n;
> +		uint64_t b = base;
> +		uint64_t res, d = 1;
> +		uint32_t high = rem >> 32;
> +
> +		/* Reduce the thing a bit first */
> +		res = 0;
> +		if (high >= base) {
> +			high /= base;
> +			res = (uint64_t) high << 32;
> +			rem -= (uint64_t) (high*base) << 32;
> +		}
> +
> +		while ((int64_t)b > 0 && b < rem) {
> +			b = b+b;
> +			d = d+d;
> +		}
> +
> +		do {
> +			if (rem >= b) {
> +				rem -= b;
> +				res += d;
> +			}
> +			b >>= 1;
> +			d >>= 1;
> +		} while (d);
>  
> -	do {
> -		if (rem >= b) {
> -			rem -= b;
> -			res += d;
> +		*n = res;
> +		__r = rem;
> +	} else if ((__b & (__b - 1)) == 0) {
> +		/*
> +		 * Trivial: __b is constant and a power of 2
> +		 * gcc does the right thing with this code.
> +		 * Even though code is the same as above but
> +		 * we make it visually as a separate path.
> +		 * Still only one of these branches will survive
> +		 * pre-processor stage, so let's leave it here.
> +		 */
> +		__r = *n;
> +		__r &= (__b - 1);
> +		*n /= __b;
> +	} else {
> +		/* Start of preprocessor calculations */
> +
> +		/*
> +		 * Multiply by inverse of __b: *n/b = *n*(p/b)/p
> +		 * We rely on the fact that most of this code gets
> +		 * optimized away at compile time due to constant
> +		 * propagation and only a couple inline assembly
> +		 * instructions should remain. Better avoid any
> +		 * code construct that might prevent that.
> +		 */
> +		unsigned long long __res, __x, __t, __m, __n = *n;
> +		unsigned int __p;
> +		/* preserve low part of *n for reminder computation */

		                               remainder

> +		__r = __n;
> +		/* determine number of bits to represent __b */
> +		__p = 1 << __div64_fls(__b);
> +		/* compute __m = ((__p << 64) + __b - 1) / __b */
> +		__m = (~0ULL / __b) * __p;
> +		__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;
> +		/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */
> +		__x = ~0ULL / __b * __b - 1;
> +		__res = (__m & 0xffffffff) * (__x & 0xffffffff);
> +		__res >>= 32;
> +		__res += (__m & 0xffffffff) * (__x >> 32);
> +		__t = __res;
> +		__res += (__x & 0xffffffff) * (__m >> 32);
> +		__t = (__res < __t) ? (1ULL << 32) : 0;
> +		__res = (__res >> 32) + __t;
> +		__res += (__m >> 32) * (__x >> 32);
> +		__res /= __p;
> +		/* End of preprocessor calculations */
> +
> +		/* Start of run-time calculations */
> +		__res = (unsigned int)__m * (unsigned int)__n;
> +		__res >>= 32;
> +		__res += (unsigned int)__m * (__n >> 32);
> +		__t = __res;
> +		__res += (unsigned int)__n * (__m >> 32);
> +		__t = (__res < __t) ? (1ULL << 32) : 0;
> +		__res = (__res >> 32) + __t;
> +		__res += (__m >> 32) * (__n >> 32);
> +		__res /= __p;
> +
> +		/*
> +		 * The reminder can be computed with 32-bit regs

		       remainder

> +		 * only, and gcc is good at that.
> +		 */
> +		{
> +			unsigned int __res0 = __res;
> +			unsigned int __b0 = __b;
> +
> +			__r -= __res0 * __b0;
>  		}
> -		b >>= 1;
> -		d >>= 1;
> -	} while (d);
> +		/* End of run-time calculations */
>  
> -	*n = res;
> -	return rem;
> +		*n = __res;
> +	}
> +	return __r;
>  }
>  
>  EXPORT_SYMBOL(__div64_32);



-- 
~Randy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-28 23:32 ` Nicolas Pitre
  2015-10-29  7:34   ` Alexey Brodkin
@ 2015-10-30  1:26   ` Nicolas Pitre
  2015-10-30  5:41     ` Vineet Gupta
                       ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-30  1:26 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: linux-kernel, linux-snps-arc, Vineet Gupta, Ingo Molnar,
	Stephen Hemminger, David S. Miller, Russell King

On Wed, 28 Oct 2015, Nicolas Pitre wrote:

> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> 
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> 
> It's not magic, it is science.  :-)
> 
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> > 
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.

Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
the one with assembly.  Still better than 100x or so without this 
optimization.

> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> > 
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> 
> You can't inline __div64_32().  It should remain as is and used only for 
> the slow path.
> 
> For the constant based optimization to work, you need to modify do_div() 
> in include/asm-generic/div64.h directly.

OK... I was intrigued, so I adapted my ARM code to the generic case, 
including the overflow avoidance optimizations.  Please have look and 
tell me how this works for you.

If this patch is accepted upstream, then it could be possible to 
abstract only the actual multiplication part with some architecture 
specific assembly.

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 8f4e319334..43c3b21dca 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -32,6 +32,149 @@
 
 #elif BITS_PER_LONG == 32
 
+/* macroized fls implementation to ensure proper constant propagation */
+#define __div64_fls(bits)						\
+({								\
+	unsigned int __left = (bits), __nr = 0;			\
+	if (__left & 0xffff0000) __nr += 16, __left >>= 16;	\
+	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;	\
+	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;	\
+	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;	\
+	if (__left & 0x00000002) __nr +=  1;			\
+	__nr;							\
+})
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications which ought to be much faster. And yet only if compiling
+ * with a sufficiently recent gcc version to perform proper 64-bit constant
+ * propagation.
+ *
+ * (It is unfortunate that gcc doesn't perform all this internally.)
+ */
+#define __div64_const32(n, ___b)						\
+({									\
+	/*								\
+	 * Multiplication by reciprocal of b: n / b = n * (p / b) / p	\
+	 *								\
+	 * We rely on the fact that most of this code gets optimized	\
+	 * away at compile time due to constant propagation and only	\
+	 * a few multiplication instructions should remain.		\
+	 * Hence this monstrous macro (static inline doesn't always	\
+	 * do the trick for some reason).				\
+	 */								\
+	uint64_t ___res, ___x, ___t, ___m, ___n = (n);			\
+	uint32_t ___c, ___p, ___m_lo, ___m_hi, ___n_lo, ___n_hi;	\
+									\
+	/* determine number of bits to represent b */			\
+	___p = 1 << __div64_fls(___b);					\
+									\
+	/* compute m = ((p << 64) + b - 1) / b */			\
+	___m = (~0ULL / ___b) * ___p;					\
+	___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b;	\
+									\
+	/* dividend that produces one less than the highest result */	\
+	___x = ~0ULL / ___b * ___b - 1;					\
+									\
+	/* test our m with res = m * x / (p << 64) */			\
+	___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32;	\
+	___t = ___res += (___m & 0xffffffff) * (___x >> 32);		\
+	___res += (___x & 0xffffffff) * (___m >> 32);			\
+	___t = (___res < ___t) ? (1ULL << 32) : 0;			\
+	___res = (___res >> 32) + ___t;					\
+	___res += (___m >> 32) * (___x >> 32);				\
+	___res /= ___p;							\
+									\
+	/* Now sanitize and optimize what we've got. */			\
+	if (~0ULL % (___b / (___b & -___b)) == 0) {			\
+		/* those cases can be simplified with: */		\
+		___n /= (___b & -___b);					\
+		___m = ~0ULL / (___b / (___b & -___b));			\
+		___p = 1;						\
+		___c = 1;						\
+	} else if (___res != ___x / ___b) {				\
+		/*							\
+		 * We can't get away without a correction to compensate	\
+		 * for bit truncation errors.  To avoid it we'd need an	\
+		 * additional bit to represent m which would overflow	\
+		 * our 64-bit variable.					\
+		 *							\
+		 * Instead we do m = p / b and n / b = (n * m + m) / p.	\
+		 */							\
+		___c = 1;						\
+		/* Compute m = (p << 64) / b */				\
+		___m = (~0ULL / ___b) * ___p;				\
+		___m += ((~0ULL % ___b + 1) * ___p) / ___b;		\
+	} else {							\
+		/*							\
+		 * Reduce m / p, and try to clear bit 31 of m when	\
+		 * possible, otherwise that'll need extra overflow	\
+		 * handling later.					\
+		 */							\
+		unsigned int ___bits = -(___m & -___m);			\
+		___bits |= ___m >> 32;					\
+		___bits = (~___bits) << 1;				\
+		/*							\
+		 * If ___bits == 0 then setting bit 31 is  unavoidable.	\
+		 * Simply apply the maximum possible reduction in that	\
+		 * case. Otherwise the MSB of ___bits indicates the	\
+		 * best reduction we should apply.			\
+		 */							\
+		if (!___bits) {						\
+			___p /= (___m & -___m);				\
+			___m /= (___m & -___m);				\
+		} else {						\
+			___p >>= __div64_fls(___bits);			\
+			___m >>= __div64_fls(___bits);			\
+		}							\
+		/* No correction needed. */				\
+		___c = 0;						\
+	}								\
+									\
+	/*								\
+	 * Now we have a combination of 2 conditions:			\
+	 *								\
+	 * 1) whether or not we need a correction (___c), and		\
+	 *								\
+	 * 2) whether or not there might be an overflow in the cross	\
+	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
+	 *								\
+	 * Select the best way to do the m * n / (p << 64) operation.	\
+	 * From here on there will be actual runtime code generated.	\
+	 */								\
+									\
+	___m_lo = ___m;							\
+	___m_hi = ___m >> 32;						\
+	___n_lo = ___n;							\
+	___n_hi = ___n >> 32;						\
+									\
+	if (!___c) {							\
+		___res = ((uint64_t)___m_lo * ___n_lo) >> 32;		\
+	} else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {		\
+		___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32;	\
+	} else {							\
+		___res = ___m + (uint64_t)___m_lo * ___n_lo;		\
+		___t = (___res < ___m) ? (1ULL << 32) : 0;		\
+	       	___res = (___res >> 32) + ___t;				\
+	}								\
+									\
+	if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {			\
+		___res += (uint64_t)___m_lo * ___n_hi;			\
+		___res += (uint64_t)___m_hi * ___n_lo;			\
+		___res >>= 32;						\
+	} else {							\
+		___t = ___res += (uint64_t)___m_lo * ___n_hi;		\
+		___res += (uint64_t)___m_hi * ___n_lo;			\
+		___t = (___res < ___t) ? (1ULL << 32) : 0;		\
+		___res = (___res >> 32) + ___t;				\
+	}								\
+									\
+	___res += (uint64_t)___m_hi * ___n_hi;				\
+									\
+	___res / ___p;							\
+})
+
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 
 /* The unnecessary pointer compare is there
@@ -41,7 +184,20 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 	uint32_t __base = (base);			\
 	uint32_t __rem;					\
 	(void)(((typeof((n)) *)0) == ((uint64_t *)0));	\
-	if (likely(((n) >> 32) == 0)) {			\
+	if (__builtin_constant_p(__base) &&		\
+	    (__base & (__base - 1)) == 0) {		\
+		/* constant power of 2: gcc is fine */	\
+		__rem = (n) & (__base - 1);		\
+		(n) /= __base;				\
+	} else if ((__GNUC__ >= 4) &&			\
+	    __builtin_constant_p(__base) &&		\
+	    __base != 0) {				\
+		uint32_t __res_lo, __n_lo = (n);	\
+		(n) = __div64_const32(n, __base);	\
+		/* the remainder can be computed with 32-bit regs */ \
+		__res_lo = (n);				\
+		__rem = __n_lo - __res_lo * __base;	\
+	} else if (likely(((n) >> 32) == 0)) {		\
 		__rem = (uint32_t)(n) % __base;		\
 		(n) = (uint32_t)(n) / __base;		\
 	} else 						\

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30  1:26   ` Nicolas Pitre
@ 2015-10-30  5:41     ` Vineet Gupta
  2015-10-30 12:41       ` Måns Rullgård
  2015-10-30 12:40     ` Måns Rullgård
  2015-10-30 14:28     ` Alexey Brodkin
  2 siblings, 1 reply; 24+ messages in thread
From: Vineet Gupta @ 2015-10-30  5:41 UTC (permalink / raw)
  To: Nicolas Pitre, Alexey Brodkin
  Cc: linux-kernel@vger.kernel.org, linux-snps-arc@lists.infradead.org,
	Ingo Molnar, Stephen Hemminger, David S. Miller, Russell King

+CC Claudiu: ARC gcc expert (please scroll to botom)

On Friday 30 October 2015 06:56 AM, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
>> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
>>
>>> Fortunately we already have much better __div64_32() for 32-bit ARM.
>>> There in case of division by constant preprocessor calculates so-called
>>> "magic number" which is later used in multiplications instead of divisions.
>> It's not magic, it is science.  :-)
>>
>>> It's really nice and very optimal but obviously works only for ARM
>>> because ARM assembly is involved.
>>>
>>> Now why don't we extend the same approach to all other 32-bit arches
>>> with multiplication part implemented in pure C. With good compiler
>>> resulting assembly will be quite close to manually written assembly.
> Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
> the one with assembly.  Still better than 100x or so without this 
> optimization.
>
>>> But there's at least 1 problem which I don't know how to solve.
>>> Preprocessor magic only happens if __div64_32() is inlined (that's
>>> obvious - preprocessor has to know if divider is constant or not).
>>>
>>> But __div64_32() is already marked as weak function (which in its turn
>>> is required to allow some architectures to provide its own optimal
>>> implementations). I.e. addition of "inline" for __div64_32() is not an
>>> option.
>> You can't inline __div64_32().  It should remain as is and used only for 
>> the slow path.
>>
>> For the constant based optimization to work, you need to modify do_div() 
>> in include/asm-generic/div64.h directly.
> OK... I was intrigued, so I adapted my ARM code to the generic case, 
> including the overflow avoidance optimizations.  Please have look and 
> tell me how this works for you.
>
> If this patch is accepted upstream, then it could be possible to 
> abstract only the actual multiplication part with some architecture 
> specific assembly.
>
> diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
> index 8f4e319334..43c3b21dca 100644
> --- a/include/asm-generic/div64.h
> +++ b/include/asm-generic/div64.h
> @@ -32,6 +32,149 @@
>  
>  #elif BITS_PER_LONG == 32
>  
> +/* macroized fls implementation to ensure proper constant propagation */
> +#define __div64_fls(bits)						\
> +({								\
> +	unsigned int __left = (bits), __nr = 0;			\
> +	if (__left & 0xffff0000) __nr += 16, __left >>= 16;	\
> +	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;	\
> +	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;	\
> +	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;	\
> +	if (__left & 0x00000002) __nr +=  1;			\
> +	__nr;							\
> +})
> +
> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications which ought to be much faster. And yet only if compiling
> + * with a sufficiently recent gcc version to perform proper 64-bit constant
> + * propagation.
> + *
> + * (It is unfortunate that gcc doesn't perform all this internally.)
> + */
> +#define __div64_const32(n, ___b)						\
> +({									\
> +	/*								\
> +	 * Multiplication by reciprocal of b: n / b = n * (p / b) / p	\
> +	 *								\
> +	 * We rely on the fact that most of this code gets optimized	\
> +	 * away at compile time due to constant propagation and only	\
> +	 * a few multiplication instructions should remain.		\
> +	 * Hence this monstrous macro (static inline doesn't always	\
> +	 * do the trick for some reason).				\
> +	 */								\
> +	uint64_t ___res, ___x, ___t, ___m, ___n = (n);			\
> +	uint32_t ___c, ___p, ___m_lo, ___m_hi, ___n_lo, ___n_hi;	\
> +									\
> +	/* determine number of bits to represent b */			\
> +	___p = 1 << __div64_fls(___b);					\
> +									\
> +	/* compute m = ((p << 64) + b - 1) / b */			\
> +	___m = (~0ULL / ___b) * ___p;					\
> +	___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b;	\
> +									\
> +	/* dividend that produces one less than the highest result */	\
> +	___x = ~0ULL / ___b * ___b - 1;					\
> +									\
> +	/* test our m with res = m * x / (p << 64) */			\
> +	___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32;	\
> +	___t = ___res += (___m & 0xffffffff) * (___x >> 32);		\
> +	___res += (___x & 0xffffffff) * (___m >> 32);			\
> +	___t = (___res < ___t) ? (1ULL << 32) : 0;			\
> +	___res = (___res >> 32) + ___t;					\
> +	___res += (___m >> 32) * (___x >> 32);				\
> +	___res /= ___p;							\
> +									\
> +	/* Now sanitize and optimize what we've got. */			\
> +	if (~0ULL % (___b / (___b & -___b)) == 0) {			\
> +		/* those cases can be simplified with: */		\
> +		___n /= (___b & -___b);					\
> +		___m = ~0ULL / (___b / (___b & -___b));			\
> +		___p = 1;						\
> +		___c = 1;						\
> +	} else if (___res != ___x / ___b) {				\
> +		/*							\
> +		 * We can't get away without a correction to compensate	\
> +		 * for bit truncation errors.  To avoid it we'd need an	\
> +		 * additional bit to represent m which would overflow	\
> +		 * our 64-bit variable.					\
> +		 *							\
> +		 * Instead we do m = p / b and n / b = (n * m + m) / p.	\
> +		 */							\
> +		___c = 1;						\
> +		/* Compute m = (p << 64) / b */				\
> +		___m = (~0ULL / ___b) * ___p;				\
> +		___m += ((~0ULL % ___b + 1) * ___p) / ___b;		\
> +	} else {							\
> +		/*							\
> +		 * Reduce m / p, and try to clear bit 31 of m when	\
> +		 * possible, otherwise that'll need extra overflow	\
> +		 * handling later.					\
> +		 */							\
> +		unsigned int ___bits = -(___m & -___m);			\
> +		___bits |= ___m >> 32;					\
> +		___bits = (~___bits) << 1;				\
> +		/*							\
> +		 * If ___bits == 0 then setting bit 31 is  unavoidable.	\
> +		 * Simply apply the maximum possible reduction in that	\
> +		 * case. Otherwise the MSB of ___bits indicates the	\
> +		 * best reduction we should apply.			\
> +		 */							\
> +		if (!___bits) {						\
> +			___p /= (___m & -___m);				\
> +			___m /= (___m & -___m);				\
> +		} else {						\
> +			___p >>= __div64_fls(___bits);			\
> +			___m >>= __div64_fls(___bits);			\
> +		}							\
> +		/* No correction needed. */				\
> +		___c = 0;						\
> +	}								\
> +									\
> +	/*								\
> +	 * Now we have a combination of 2 conditions:			\
> +	 *								\
> +	 * 1) whether or not we need a correction (___c), and		\
> +	 *								\
> +	 * 2) whether or not there might be an overflow in the cross	\
> +	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
> +	 *								\
> +	 * Select the best way to do the m * n / (p << 64) operation.	\
> +	 * From here on there will be actual runtime code generated.	\
> +	 */								\
> +									\
> +	___m_lo = ___m;							\
> +	___m_hi = ___m >> 32;						\
> +	___n_lo = ___n;							\
> +	___n_hi = ___n >> 32;						\
> +									\
> +	if (!___c) {							\
> +		___res = ((uint64_t)___m_lo * ___n_lo) >> 32;		\
> +	} else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {		\
> +		___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32;	\
> +	} else {							\
> +		___res = ___m + (uint64_t)___m_lo * ___n_lo;		\
> +		___t = (___res < ___m) ? (1ULL << 32) : 0;		\
> +	       	___res = (___res >> 32) + ___t;				\
> +	}								\
> +									\
> +	if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {			\
> +		___res += (uint64_t)___m_lo * ___n_hi;			\
> +		___res += (uint64_t)___m_hi * ___n_lo;			\
> +		___res >>= 32;						\
> +	} else {							\
> +		___t = ___res += (uint64_t)___m_lo * ___n_hi;		\
> +		___res += (uint64_t)___m_hi * ___n_lo;			\
> +		___t = (___res < ___t) ? (1ULL << 32) : 0;		\
> +		___res = (___res >> 32) + ___t;				\
> +	}								\
> +									\
> +	___res += (uint64_t)___m_hi * ___n_hi;				\
> +									\
> +	___res / ___p;							\
> +})
> +
>  extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
>  
>  /* The unnecessary pointer compare is there
> @@ -41,7 +184,20 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
>  	uint32_t __base = (base);			\
>  	uint32_t __rem;					\
>  	(void)(((typeof((n)) *)0) == ((uint64_t *)0));	\
> -	if (likely(((n) >> 32) == 0)) {			\
> +	if (__builtin_constant_p(__base) &&		\
> +	    (__base & (__base - 1)) == 0) {		\
> +		/* constant power of 2: gcc is fine */	\
> +		__rem = (n) & (__base - 1);		\
> +		(n) /= __base;				\
> +	} else if ((__GNUC__ >= 4) &&			\
> +	    __builtin_constant_p(__base) &&		\
> +	    __base != 0) {				\
> +		uint32_t __res_lo, __n_lo = (n);	\
> +		(n) = __div64_const32(n, __base);	\
> +		/* the remainder can be computed with 32-bit regs */ \
> +		__res_lo = (n);				\
> +		__rem = __n_lo - __res_lo * __base;	\
> +	} else if (likely(((n) >> 32) == 0)) {		\
>  		__rem = (uint32_t)(n) % __base;		\
>  		(n) = (uint32_t)(n) / __base;		\
>  	} else 						\
>

Claudiu, can some of this not be done in gcc itself !

-Vineet

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30  1:26   ` Nicolas Pitre
  2015-10-30  5:41     ` Vineet Gupta
@ 2015-10-30 12:40     ` Måns Rullgård
  2015-10-30 15:17       ` Nicolas Pitre
  2015-10-30 14:28     ` Alexey Brodkin
  2 siblings, 1 reply; 24+ messages in thread
From: Måns Rullgård @ 2015-10-30 12:40 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexey Brodkin, linux-kernel, linux-snps-arc, Vineet Gupta,
	Ingo Molnar, Stephen Hemminger, David S. Miller, Russell King

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
>
>> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
>> 
>> > Fortunately we already have much better __div64_32() for 32-bit ARM.
>> > There in case of division by constant preprocessor calculates so-called
>> > "magic number" which is later used in multiplications instead of divisions.
>> 
>> It's not magic, it is science.  :-)
>> 
>> > It's really nice and very optimal but obviously works only for ARM
>> > because ARM assembly is involved.
>> > 
>> > Now why don't we extend the same approach to all other 32-bit arches
>> > with multiplication part implemented in pure C. With good compiler
>> > resulting assembly will be quite close to manually written assembly.
>
> Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
> the one with assembly.  Still better than 100x or so without this 
> optimization.

That's more or less what I found on MIPS too.

>> > But there's at least 1 problem which I don't know how to solve.
>> > Preprocessor magic only happens if __div64_32() is inlined (that's
>> > obvious - preprocessor has to know if divider is constant or not).
>> > 
>> > But __div64_32() is already marked as weak function (which in its turn
>> > is required to allow some architectures to provide its own optimal
>> > implementations). I.e. addition of "inline" for __div64_32() is not an
>> > option.
>> 
>> You can't inline __div64_32().  It should remain as is and used only for 
>> the slow path.
>> 
>> For the constant based optimization to work, you need to modify do_div() 
>> in include/asm-generic/div64.h directly.
>
> OK... I was intrigued, so I adapted my ARM code to the generic case, 
> including the overflow avoidance optimizations.  Please have look and 
> tell me how this works for you.
>
> If this patch is accepted upstream, then it could be possible to 
> abstract only the actual multiplication part with some architecture 
> specific assembly.

Good idea.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30  5:41     ` Vineet Gupta
@ 2015-10-30 12:41       ` Måns Rullgård
  0 siblings, 0 replies; 24+ messages in thread
From: Måns Rullgård @ 2015-10-30 12:41 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Nicolas Pitre, Alexey Brodkin, linux-kernel@vger.kernel.org,
	Russell King, Ingo Molnar, linux-snps-arc@lists.infradead.org,
	Stephen Hemminger, David S. Miller

Vineet Gupta <Vineet.Gupta1@synopsys.com> writes:

> Claudiu, can some of this not be done in gcc itself !

All of it could be done by gcc.  The trouble is that it isn't.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30  1:26   ` Nicolas Pitre
  2015-10-30  5:41     ` Vineet Gupta
  2015-10-30 12:40     ` Måns Rullgård
@ 2015-10-30 14:28     ` Alexey Brodkin
  2 siblings, 0 replies; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-30 14:28 UTC (permalink / raw)
  To: nicolas.pitre@linaro.org
  Cc: rmk+kernel@arm.linux.org.uk, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, shemminger@linux-foundation.org,
	mingo@elte.hu, linux-snps-arc@lists.infradead.org,
	davem@davemloft.net

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4738 bytes --]

Hi Nicolas,

On Thu, 2015-10-29 at 21:26 -0400, Nicolas Pitre wrote:
> On Wed, 28 Oct 2015, Nicolas Pitre wrote:
> 
> > On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> > 
> > > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > > There in case of division by constant preprocessor calculates so-called
> > > "magic number" which is later used in multiplications instead of divisions.
> > 
> > It's not magic, it is science.  :-)
> > 
> > > It's really nice and very optimal but obviously works only for ARM
> > > because ARM assembly is involved.
> > > 
> > > Now why don't we extend the same approach to all other 32-bit arches
> > > with multiplication part implemented in pure C. With good compiler
> > > resulting assembly will be quite close to manually written assembly.
> 
> Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
> the one with assembly.  Still better than 100x or so without this 
> optimization.

Indeed even having that function 25 times faster instead of 100 times is
already quite an achievement. For example that will already cure my iperf
performance degradation: I'll see do_div() taking < 1% instead of > 10% now.

My test source was:
--------------------->8------------------------
int myfunc(u64 data)
{
	return do_div(data, 1000);
}
--------------------->8------------------------

Now take a look at disassembly that I'm getting:
--------------------->8------------------------
0000062c <myfunc>:
 62c:	19 28 86 0f 4f 8d 3b df 	mpydu      r6,r0,0x8d4fdf3b
 634:	00 26 86 8f 4f 8d 3b df 	add.f      r6,r6,0x8d4fdf3b
 63c:	02 27 87 0f ed 7c 69 91 	sub        r7,r7,0x7ced9169
 644:	c0 27 65 00             	add.c      r7,r7,1
 648:	85 0e c4 71 12 83 97 6e 	brlo       0x83126e97,r7,6cc <myfunc+0xa0>

 650:	75 0f 80 0f 12 83 97 6e 	breq       r7,0x83126e97,6c4 <myfunc+0x98>

 658:	0d 70                   	mov_s      r8,0
 65a:	2d 71                   	mov_s      r9,1
 65c:	19 29 8a 0f 4f 8d 3b df 	mpydu      r10,r1,0x8d4fdf3b
 664:	00 27 84 82             	add.f      r4,r7,r10
 668:	ac 70                   	mov_s      r5,0
 66a:	19 28 86 0f 12 83 97 6e 	mpydu      r6,r0,0x83126e97
 672:	01 25 c5 02             	adc        r5,r5,r11
 676:	00 24 04 82             	add.f      r4,r4,r8
 67a:	00 25 45 02             	add        r5,r5,r9
 67e:	c0 25 65 00             	add.c      r5,r5,1
 682:	00 26 06 81             	add.f      r6,r6,r4
 686:	01 27 47 01             	adc        r7,r7,r5
 68a:	51 0f 44 01             	brlo       r7,r5,6d8 <myfunc+0xac>

 68e:	49 0d c0 01             	breq       r5,r7,6d4 <myfunc+0xa8>

 692:	8c 70                   	mov_s      r4,0
 694:	ac 70                   	mov_s      r5,0
 696:	e0 42                   	mov_s      r2,r7
 698:	19 29 86 0f 12 83 97 6e 	mpydu      r6,r1,0x83126e97
 6a0:	00 22 82 81             	add.f      r2,r2,r6
 6a4:	6c 70                   	mov_s      r3,0
 6a6:	01 23 c3 01             	adc        r3,r3,r7
 6aa:	00 22 02 81             	add.f      r2,r2,r4
 6ae:	a0 73                   	add_s      r3,r3,r5
 6b0:	c0 23 65 00             	add.c      r3,r3,1
 6b4:	29 ba                   	lsr_s      r2,r2,9
 6b6:	17 bb                   	asl_s      r3,r3,23
 6b8:	65 7a                   	or_s       r2,r2,r3
 6ba:	9a 22 0f 0a             	mpy        r2,r2,0x3e8
 6be:	e0 7f                   	j_s.d      [blink]
 6c0:	42 78                   	sub_s      r0,r0,r2
 6c2:	e0 78                   	nop_s      
 6c4:	95 0e 85 f1 4f 8d 3a df 	brhs.nt    0x8d4fdf3a,r6,658 <myfunc+0x2c>

 6cc:	0d 70                   	mov_s      r8,0
 6ce:	91 07 ef ff             	b.d        65c <myfunc+0x30>

 6d2:	2d 70                   	mov_s      r9,0
 6d4:	bf 0e 05 81             	brhs.nt    r6,r4,692 <myfunc+0x66>

 6d8:	8c 70                   	mov_s      r4,0
 6da:	bf 07 ef ff             	b.d        696 <myfunc+0x6a>

 6de:	ac 71                   	mov_s      r5,1
--------------------->8------------------------

What you see here is pretty straight-forward conversion to assembly of "run-time calculations".
Things to note:
 [1] Only 5 multiplications are used. That's because we have 32x32 multiplication unit
     that returns 64-bit result in register pair.

 [2] Indeed lots of moves and additions happen here.

So my conclusion would be:
 [1] Proposed implementation makes perfect sense because already speeds-up do_div()
     significantly.

 [2] Ability to substitute "run-time calculations" on per-arch basis would be awsome
     because with few lines of assembly another 2-4 times of improvement could be
     achieved.

-Alexeyÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30 12:40     ` Måns Rullgård
@ 2015-10-30 15:17       ` Nicolas Pitre
  2015-10-30 15:54         ` Alexey Brodkin
  0 siblings, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-30 15:17 UTC (permalink / raw)
  To: Måns Rullgård
  Cc: Alexey Brodkin, linux-kernel, linux-snps-arc, Vineet Gupta,
	Ingo Molnar, Stephen Hemminger, David S. Miller, Russell King

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

On Fri, 30 Oct 2015, Måns Rullgård wrote:

> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> 
> > OK... I was intrigued, so I adapted my ARM code to the generic case, 
> > including the overflow avoidance optimizations.  Please have look and 
> > tell me how this works for you.
> >
> > If this patch is accepted upstream, then it could be possible to 
> > abstract only the actual multiplication part with some architecture 
> > specific assembly.
> 
> Good idea.

Could you please provide a reviewed-by or acked-by tag?


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30 15:17       ` Nicolas Pitre
@ 2015-10-30 15:54         ` Alexey Brodkin
  2015-10-30 16:55           ` Nicolas Pitre
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Brodkin @ 2015-10-30 15:54 UTC (permalink / raw)
  To: nicolas.pitre@linaro.org
  Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
	mans@mansr.com, rmk+kernel@arm.linux.org.uk, davem@davemloft.net,
	mingo@elte.hu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1332 bytes --]

Hi Nicolas,

On Fri, 2015-10-30 at 11:17 -0400, Nicolas Pitre wrote:
> On Fri, 30 Oct 2015, MÃ¥ns RullgÃ¥rd wrote:
> 
> > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> > 
> > > OK... I was intrigued, so I adapted my ARM code to the generic case, 
> > > including the overflow avoidance optimizations.  Please have look and 
> > > tell me how this works for you.
> > > 
> > > If this patch is accepted upstream, then it could be possible to 
> > > abstract only the actual multiplication part with some architecture 
> > > specific assembly.
> > 
> > Good idea.
> 
> Could you please provide a reviewed-by or acked-by tag?

Sure!

Acked-by: Alexey Brodkin <abrodkin@synopsys.com>

BTW I thought about that optimization a bit more and now I think
we may even skip addition of arch-specific assembly insertions.

That's because that kind of division as discussed many times
should be used as limited as possible, in other words there should be
just a very few usages of it especially in very frequently used code paths.
And in that case there might be not much of benefit having do_div()
even faster and smaller than the one we're about to get with your change.

-Alexeyÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30 15:54         ` Alexey Brodkin
@ 2015-10-30 16:55           ` Nicolas Pitre
  2015-10-30 17:45             ` Måns Rullgård
  0 siblings, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2015-10-30 16:55 UTC (permalink / raw)
  To: Alexey Brodkin
  Cc: shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
	Vineet.Gupta1@synopsys.com, linux-snps-arc@lists.infradead.org,
	mans@mansr.com, rmk+kernel@arm.linux.org.uk, davem@davemloft.net,
	mingo@elte.hu

[-- Attachment #1: Type: text/plain, Size: 1096 bytes --]

On Fri, 30 Oct 2015, Alexey Brodkin wrote:

> Hi Nicolas,
> 
> On Fri, 2015-10-30 at 11:17 -0400, Nicolas Pitre wrote:
> > On Fri, 30 Oct 2015, Måns Rullgård wrote:
> > 
> > > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> > > 
> > > > OK... I was intrigued, so I adapted my ARM code to the generic case, 
> > > > including the overflow avoidance optimizations.  Please have look and 
> > > > tell me how this works for you.
> > > > 
> > > > If this patch is accepted upstream, then it could be possible to 
> > > > abstract only the actual multiplication part with some architecture 
> > > > specific assembly.
> > > 
> > > Good idea.
> > 
> > Could you please provide a reviewed-by or acked-by tag?
> 
> Sure!
> 
> Acked-by: Alexey Brodkin <abrodkin@synopsys.com>
> 
> BTW I thought about that optimization a bit more and now I think
> we may even skip addition of arch-specific assembly insertions.

I'm going to do it anyway given that I already have it for ARM.  It'll 
be opt-in, so if your arch doesn't provide it then the current C 
implementation will be used by default.


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30 16:55           ` Nicolas Pitre
@ 2015-10-30 17:45             ` Måns Rullgård
  2015-11-04 23:46               ` Nicolas Pitre
  0 siblings, 1 reply; 24+ messages in thread
From: Måns Rullgård @ 2015-10-30 17:45 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexey Brodkin, shemminger@linux-foundation.org,
	linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
	linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
	davem@davemloft.net, mingo@elte.hu

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Fri, 30 Oct 2015, Alexey Brodkin wrote:
>
>> Hi Nicolas,
>> 
>> On Fri, 2015-10-30 at 11:17 -0400, Nicolas Pitre wrote:
>> > On Fri, 30 Oct 2015, Måns Rullgård wrote:
>> > 
>> > > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>> > > 
>> > > > OK... I was intrigued, so I adapted my ARM code to the generic case, 
>> > > > including the overflow avoidance optimizations.  Please have look and 
>> > > > tell me how this works for you.
>> > > > 
>> > > > If this patch is accepted upstream, then it could be possible to 
>> > > > abstract only the actual multiplication part with some architecture 
>> > > > specific assembly.
>> > > 
>> > > Good idea.
>> > 
>> > Could you please provide a reviewed-by or acked-by tag?
>> 
>> Sure!
>> 
>> Acked-by: Alexey Brodkin <abrodkin@synopsys.com>
>> 
>> BTW I thought about that optimization a bit more and now I think
>> we may even skip addition of arch-specific assembly insertions.
>
> I'm going to do it anyway given that I already have it for ARM.  It'll 
> be opt-in, so if your arch doesn't provide it then the current C 
> implementation will be used by default.

Great.  I'll try it out on MIPS once you've posted the patch.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-10-30 17:45             ` Måns Rullgård
@ 2015-11-04 23:46               ` Nicolas Pitre
  2015-11-04 23:48                 ` Nicolas Pitre
  2015-11-04 23:49                 ` Måns Rullgård
  0 siblings, 2 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-11-04 23:46 UTC (permalink / raw)
  To: Måns Rullgård
  Cc: Alexey Brodkin, shemminger@linux-foundation.org,
	linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
	linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
	davem@davemloft.net, mingo@elte.hu

[-- Attachment #1: Type: text/plain, Size: 525 bytes --]

On Fri, 30 Oct 2015, Måns Rullgård wrote:

> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> 
> > I'm going to do it anyway given that I already have it for ARM.  It'll 
> > be opt-in, so if your arch doesn't provide it then the current C 
> > implementation will be used by default.
> 
> Great.  I'll try it out on MIPS once you've posted the patch.

You should have seen the patches by now.

I've put them along with a bunch of do_div() usage fixes here:

  http://git.linaro.org/people/nicolas.pitre/linux.git


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-11-04 23:46               ` Nicolas Pitre
@ 2015-11-04 23:48                 ` Nicolas Pitre
  2015-11-05  3:13                   ` Vineet Gupta
  2015-11-04 23:49                 ` Måns Rullgård
  1 sibling, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2015-11-04 23:48 UTC (permalink / raw)
  To: Måns Rullgård
  Cc: Alexey Brodkin, shemminger@linux-foundation.org,
	linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
	linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
	davem@davemloft.net, mingo@elte.hu

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

On Wed, 4 Nov 2015, Nicolas Pitre wrote:

> On Fri, 30 Oct 2015, Måns Rullgård wrote:
> 
> > Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> > 
> > > I'm going to do it anyway given that I already have it for ARM.  It'll 
> > > be opt-in, so if your arch doesn't provide it then the current C 
> > > implementation will be used by default.
> > 
> > Great.  I'll try it out on MIPS once you've posted the patch.
> 
> You should have seen the patches by now.
> 
> I've put them along with a bunch of do_div() usage fixes here:
> 
>   http://git.linaro.org/people/nicolas.pitre/linux.git

More precisely:

  http://git.linaro.org/people/nicolas.pitre/linux.git div64


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-11-04 23:46               ` Nicolas Pitre
  2015-11-04 23:48                 ` Nicolas Pitre
@ 2015-11-04 23:49                 ` Måns Rullgård
  1 sibling, 0 replies; 24+ messages in thread
From: Måns Rullgård @ 2015-11-04 23:49 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexey Brodkin, shemminger@linux-foundation.org,
	linux-kernel@vger.kernel.org, Vineet.Gupta1@synopsys.com,
	linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
	davem@davemloft.net, mingo@elte.hu

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Fri, 30 Oct 2015, Måns Rullgård wrote:
>
>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>> 
>> > I'm going to do it anyway given that I already have it for ARM.  It'll 
>> > be opt-in, so if your arch doesn't provide it then the current C 
>> > implementation will be used by default.
>> 
>> Great.  I'll try it out on MIPS once you've posted the patch.
>
> You should have seen the patches by now.

Yes, I've seen them.  I've been busy with other things but hope get to
it before the end of the week.  At first glance, it looks good to me.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-11-04 23:48                 ` Nicolas Pitre
@ 2015-11-05  3:13                   ` Vineet Gupta
  2015-11-05  5:06                     ` Nicolas Pitre
  0 siblings, 1 reply; 24+ messages in thread
From: Vineet Gupta @ 2015-11-05  3:13 UTC (permalink / raw)
  To: Nicolas Pitre, Måns Rullgård
  Cc: Alexey Brodkin, shemminger@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-snps-arc@lists.infradead.org,
	rmk+kernel@arm.linux.org.uk, davem@davemloft.net, mingo@elte.hu

On Thursday 05 November 2015 05:18 AM, Nicolas Pitre wrote:
> On Wed, 4 Nov 2015, Nicolas Pitre wrote:
>
>> On Fri, 30 Oct 2015, Måns Rullgård wrote:
>>
>>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>>>
>>>> I'm going to do it anyway given that I already have it for ARM.  It'll 
>>>> be opt-in, so if your arch doesn't provide it then the current C 
>>>> implementation will be used by default.
>>> Great.  I'll try it out on MIPS once you've posted the patch.
>> You should have seen the patches by now.
>>
>> I've put them along with a bunch of do_div() usage fixes here:
>>
>>   http://git.linaro.org/people/nicolas.pitre/linux.git
> More precisely:
>
>   http://git.linaro.org/people/nicolas.pitre/linux.git div64

Hi Nico,

While we are current on the topic I was wondering about another optimization in
this area.
The slowpath __div64_32() generates both quotient and remainder. The more general
use case in kernel only cares about quotient.

git grep "\sdo_div(" |  wc -l
841
git grep "=\sdo_div(" |  wc -l
116

Is it possible to optimize the code, if remainder was *not* needed explicitly. I
understand that the hand divide will still need some sort of running tally of
remainder but can the code be any better in this case. That way we can introduce
another API do_div_norem() and start proliferating it for the cases where
remainder is not used.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches
  2015-11-05  3:13                   ` Vineet Gupta
@ 2015-11-05  5:06                     ` Nicolas Pitre
  0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2015-11-05  5:06 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Måns Rullgård, Alexey Brodkin,
	shemminger@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-snps-arc@lists.infradead.org, rmk+kernel@arm.linux.org.uk,
	davem@davemloft.net, mingo@elte.hu

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

On Thu, 5 Nov 2015, Vineet Gupta wrote:

> On Thursday 05 November 2015 05:18 AM, Nicolas Pitre wrote:
> > On Wed, 4 Nov 2015, Nicolas Pitre wrote:
> >
> >> On Fri, 30 Oct 2015, Måns Rullgård wrote:
> >>
> >>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> >>>
> >>>> I'm going to do it anyway given that I already have it for ARM.  It'll 
> >>>> be opt-in, so if your arch doesn't provide it then the current C 
> >>>> implementation will be used by default.
> >>> Great.  I'll try it out on MIPS once you've posted the patch.
> >> You should have seen the patches by now.
> >>
> >> I've put them along with a bunch of do_div() usage fixes here:
> >>
> >>   http://git.linaro.org/people/nicolas.pitre/linux.git
> > More precisely:
> >
> >   http://git.linaro.org/people/nicolas.pitre/linux.git div64
> 
> Hi Nico,
> 
> While we are current on the topic I was wondering about another optimization in
> this area.
> The slowpath __div64_32() generates both quotient and remainder. The more general
> use case in kernel only cares about quotient.
> 
> git grep "\sdo_div(" |  wc -l
> 841
> git grep "=\sdo_div(" |  wc -l
> 116
> 
> Is it possible to optimize the code, if remainder was *not* needed explicitly. I
> understand that the hand divide will still need some sort of running tally of
> remainder but can the code be any better in this case. That way we can introduce
> another API do_div_norem() and start proliferating it for the cases where
> remainder is not used.

I don't think you'll be able to optimize the code much.  If you look at 
the division loop, you always have the current remainder to process as 
you say, so when you can't substract from the remainder anymore you 
simply return that value.  And on ARM we simply tell the calling code 
about which register contains the remainder if it wants it.  Therefore 
on ARM the code would be exactly the same in either cases.


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-11-05  5:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-28 22:47 [PATCH] __div64_32: implement division by multiplication for 32-bit arches Alexey Brodkin
2015-10-28 23:32 ` Nicolas Pitre
2015-10-29  7:34   ` Alexey Brodkin
2015-10-30  1:26   ` Nicolas Pitre
2015-10-30  5:41     ` Vineet Gupta
2015-10-30 12:41       ` Måns Rullgård
2015-10-30 12:40     ` Måns Rullgård
2015-10-30 15:17       ` Nicolas Pitre
2015-10-30 15:54         ` Alexey Brodkin
2015-10-30 16:55           ` Nicolas Pitre
2015-10-30 17:45             ` Måns Rullgård
2015-11-04 23:46               ` Nicolas Pitre
2015-11-04 23:48                 ` Nicolas Pitre
2015-11-05  3:13                   ` Vineet Gupta
2015-11-05  5:06                     ` Nicolas Pitre
2015-11-04 23:49                 ` Måns Rullgård
2015-10-30 14:28     ` Alexey Brodkin
2015-10-29  0:36 ` kbuild test robot
2015-10-29 12:52 ` Måns Rullgård
2015-10-29 13:05   ` Alexey Brodkin
2015-10-29 13:37     ` Måns Rullgård
2015-10-29 13:31 ` Russell King - ARM Linux
2015-10-29 14:32   ` Alexey Brodkin
2015-10-29 17:09 ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox