[PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines
@ 2015-11-02 22:33 Nicolas Pitre
  2015-11-02 22:33 ` Nicolas Pitre
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

This is a generalization of the optimization I produced for ARM a decade
ago to turn constant divisors into a multiplication by the divisor
reciprocal. Turns out that after all those years gcc is still not
optimizing things on its own for that case.

This has important performance benefits as discussed in this thread:

https://lkml.org/lkml/2015/10/28/851

This series brings the formerly ARMonly optimization to all 32-bit
architectures using C code by default.  The possibility for the actual
multiplication to be implemented in assembly is provided in order to get
optimal code.  The ARM version can be used as an example implementation
for other interested architectures to implement.

Nicolas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines
  2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
@ 2015-11-02 22:33 ` Nicolas Pitre
  2015-11-02 22:33 ` [PATCH 1/5] div64.h: optimize do_div() for power-of-two constant divisors Nicolas Pitre
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

This is a generalization of the optimization I produced for ARM a decade
ago to turn constant divisors into a multiplication by the divisor
reciprocal. Turns out that after all those years gcc is still not
optimizing things on its own for that case.

This has important performance benefits as discussed in this thread:

https://lkml.org/lkml/2015/10/28/851

This series brings the formerly ARMonly optimization to all 32-bit
architectures using C code by default.  The possibility for the actual
multiplication to be implemented in assembly is provided in order to get
optimal code.  The ARM version can be used as an example implementation
for other interested architectures to implement.

Nicolas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] div64.h: optimize do_div() for power-of-two constant divisors
  2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
  2015-11-02 22:33 ` Nicolas Pitre
@ 2015-11-02 22:33 ` Nicolas Pitre
  2015-11-02 22:33   ` Nicolas Pitre
  2015-11-02 22:33 ` [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines Nicolas Pitre
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

On 32-bit targets, gcc is able to do the right thing with a constant
divisor that happens to be a power of two i.e. it turns the division
into a right shift inline.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 8f4e319334..47aa1e2134 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -41,7 +41,12 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 	uint32_t __base = (base);			\
 	uint32_t __rem;					\
 	(void)(((typeof((n)) *)0) == ((uint64_t *)0));	\
-	if (likely(((n) >> 32) == 0)) {			\
+	if (__builtin_constant_p(__base) &&		\
+	    (__base & (__base - 1)) == 0) {		\
+		/* constant power of 2: gcc is fine */	\
+		__rem = (n) & (__base - 1);		\
+		(n) /= __base;				\
+	} else if (likely(((n) >> 32) == 0)) {		\
 		__rem = (uint32_t)(n) % __base;		\
 		(n) = (uint32_t)(n) / __base;		\
 	} else 						\
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 1/5] div64.h: optimize do_div() for power-of-two constant divisors
  2015-11-02 22:33 ` [PATCH 1/5] div64.h: optimize do_div() for power-of-two constant divisors Nicolas Pitre
@ 2015-11-02 22:33   ` Nicolas Pitre
  0 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

On 32-bit targets, gcc is able to do the right thing with a constant
divisor that happens to be a power of two i.e. it turns the division
into a right shift inline.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 8f4e319334..47aa1e2134 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -41,7 +41,12 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 	uint32_t __base = (base);			\
 	uint32_t __rem;					\
 	(void)(((typeof((n)) *)0) == ((uint64_t *)0));	\
-	if (likely(((n) >> 32) == 0)) {			\
+	if (__builtin_constant_p(__base) &&		\
+	    (__base & (__base - 1)) == 0) {		\
+		/* constant power of 2: gcc is fine */	\
+		__rem = (n) & (__base - 1);		\
+		(n) /= __base;				\
+	} else if (likely(((n) >> 32) == 0)) {		\
 		__rem = (uint32_t)(n) % __base;		\
 		(n) = (uint32_t)(n) / __base;		\
 	} else 						\
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines
  2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
  2015-11-02 22:33 ` Nicolas Pitre
  2015-11-02 22:33 ` [PATCH 1/5] div64.h: optimize do_div() for power-of-two constant divisors Nicolas Pitre
@ 2015-11-02 22:33 ` Nicolas Pitre
  2015-11-03  5:32   ` kbuild test robot
  2015-11-02 22:33 ` [PATCH 3/5] __div64_const32(): abstract out the actual 128-bit cross product code Nicolas Pitre
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

64-by-32-bit divisions are prominent in the kernel, even on 32-bit
machines.  Luckily, many of them use a constant divisor that allows
for a much faster multiplication by the divisor's reciprocal.

The compiler already performs this optimization when compiling a 32-by-32
division with a constant divisor. Unfortunately, on 32-bit machines, gcc
does not optimize 64-by-32 divisions in that case, except for constant
divisors that happen to be a power of 2.

Let's avoid the slow path whenever the divisor is constant by manually
computing the reciprocal ourselves and performing the multiplication
inline.  This improves performance of 64-by-32 divisions by about two
orders of magnitude in most cases.

The algorithm used here comes from the existing ARM code.

The __div64_const32_is_OK macro can be predefined by architectures to
disable this optimization in some cases. For example, some ancient gcc
version on ARM would crash with an ICE when fed this code.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 159 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 47aa1e2134..c3612873e4 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -4,6 +4,9 @@
  * Copyright (C) 2003 Bernardo Innocenti <bernie@develer.com>
  * Based on former asm-ppc/div64.h and asm-m68knommu/div64.h
  *
+ * Optimization for constant divisors on 32-bit machines:
+ * Copyright (C) 2006-2015 Nicolas Pitre
+ *
  * The semantics of do_div() are:
  *
  * uint32_t do_div(uint64_t *n, uint32_t base)
@@ -32,6 +35,154 @@
 
 #elif BITS_PER_LONG == 32
 
+/* macroized fls implementation to ensure proper constant propagation */
+#define __div64_fls(bits)						\
+({								\
+	unsigned int __left = (bits), __nr = 0;			\
+	if (__left & 0xffff0000) __nr += 16, __left >>= 16;	\
+	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;	\
+	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;	\
+	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;	\
+	if (__left & 0x00000002) __nr +=  1;			\
+	__nr;							\
+})
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications which ought to be much faster. And yet only if compiling
+ * with a sufficiently recent gcc version to perform proper 64-bit constant
+ * propagation.
+ *
+ * (It is unfortunate that gcc doesn't perform all this internally.)
+ */
+
+#ifndef __div64_const32_is_OK
+#define __div64_const32_is_OK (__GNUC__ >= 4)
+#endif
+
+#define __div64_const32(n, ___b)					\
+({									\
+	/*								\
+	 * Multiplication by reciprocal of b: n / b = n * (p / b) / p	\
+	 *								\
+	 * We rely on the fact that most of this code gets optimized	\
+	 * away at compile time due to constant propagation and only	\
+	 * a few multiplication instructions should remain.		\
+	 * Hence this monstrous macro (static inline doesn't always	\
+	 * do the trick here).						\
+	 */								\
+	uint64_t ___res, ___x, ___t, ___m, ___n = (n);			\
+	uint32_t ___p, ___bias, ___m_lo, ___m_hi, ___n_lo, ___n_hi;	\
+									\
+	/* determine number of bits to represent b */			\
+	___p = 1 << __div64_fls(___b);					\
+									\
+	/* compute m = ((p << 64) + b - 1) / b */			\
+	___m = (~0ULL / ___b) * ___p;					\
+	___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b;	\
+									\
+	/* one less than the dividend with highest result */		\
+	___x = ~0ULL / ___b * ___b - 1;					\
+									\
+	/* test our ___m with res = m * x / (p << 64) */		\
+	___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32;	\
+	___t = ___res += (___m & 0xffffffff) * (___x >> 32);		\
+	___res += (___x & 0xffffffff) * (___m >> 32);			\
+	___t = (___res < ___t) ? (1ULL << 32) : 0;			\
+	___res = (___res >> 32) + ___t;					\
+	___res += (___m >> 32) * (___x >> 32);				\
+	___res /= ___p;							\
+									\
+	/* Now sanitize and optimize what we've got. */			\
+	if (~0ULL % (___b / (___b & -___b)) == 0) {			\
+		/* special case, can be simplified to ... */		\
+		___n /= (___b & -___b);					\
+		___m = ~0ULL / (___b / (___b & -___b));			\
+		___p = 1;						\
+		___bias = 1;						\
+	} else if (___res != ___x / ___b) {				\
+		/*							\
+		 * We can't get away without a bias to compensate	\
+		 * for bit truncation errors.  To avoid it we'd need an	\
+		 * additional bit to represent m which would overflow	\
+		 * a 64-bit variable.					\
+		 *							\
+		 * Instead we do m = p / b and n / b = (n * m + m) / p.	\
+		 */							\
+		___bias = 1;						\
+		/* Compute m = (p << 64) / b */				\
+		___m = (~0ULL / ___b) * ___p;				\
+		___m += ((~0ULL % ___b + 1) * ___p) / ___b;		\
+	} else {							\
+		/*							\
+		 * Reduce m / p, and try to clear bit 31 of m when	\
+		 * possible, otherwise that'll need extra overflow	\
+		 * handling later.					\
+		 */							\
+		uint32_t ___bits = -(___m & -___m);			\
+		___bits |= ___m >> 32;					\
+		___bits = (~___bits) << 1;				\
+		/*							\
+		 * If ___bits == 0 then setting bit 31 is  unavoidable.	\
+		 * Simply apply the maximum possible reduction in that	\
+		 * case. Otherwise the MSB of ___bits indicates the	\
+		 * best reduction we should apply.			\
+		 */							\
+		if (!___bits) {						\
+			___p /= (___m & -___m);				\
+			___m /= (___m & -___m);				\
+		} else {						\
+			___p >>= __div64_fls(___bits);			\
+			___m >>= __div64_fls(___bits);			\
+		}							\
+		/* No bias needed. */					\
+		___bias = 0;						\
+	}								\
+									\
+	/*								\
+	 * Now we have a combination of 2 conditions:			\
+	 *								\
+	 * 1) whether or not we need to apply a bias, and		\
+	 *								\
+	 * 2) whether or not there might be an overflow in the cross	\
+	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
+	 *								\
+	 * Select the best way to do (m_bias + m * n) / (p << 64).	\
+	 * From now on there will be actual runtime code generated.	\
+	 */								\
+									\
+	___m_lo = ___m;							\
+	___m_hi = ___m >> 32;						\
+	___n_lo = ___n;							\
+	___n_hi = ___n >> 32;						\
+									\
+	if (!___bias) {							\
+		___res = ((uint64_t)___m_lo * ___n_lo) >> 32;		\
+	} else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {		\
+		___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32;	\
+	} else {							\
+		___res = ___m + (uint64_t)___m_lo * ___n_lo;		\
+		___t = (___res < ___m) ? (1ULL << 32) : 0;		\
+		___res = (___res >> 32) + ___t;				\
+	}								\
+									\
+	if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {			\
+		___res += (uint64_t)___m_lo * ___n_hi;			\
+		___res += (uint64_t)___m_hi * ___n_lo;			\
+		___res >>= 32;						\
+	} else {							\
+		___t = ___res += (uint64_t)___m_lo * ___n_hi;		\
+		___res += (uint64_t)___m_hi * ___n_lo;			\
+		___t = (___res < ___t) ? (1ULL << 32) : 0;		\
+		___res = (___res >> 32) + ___t;				\
+	}								\
+									\
+	___res += (uint64_t)___m_hi * ___n_hi;				\
+									\
+	___res /= ___p;							\
+})
+
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 
 /* The unnecessary pointer compare is there
@@ -46,6 +197,14 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 		/* constant power of 2: gcc is fine */	\
 		__rem = (n) & (__base - 1);		\
 		(n) /= __base;				\
+	} else if (__div64_const32_is_OK &&		\
+		   __builtin_constant_p(__base) &&	\
+		   __base != 0) {			\
+		uint32_t __res_lo, __n_lo = (n);	\
+		(n) = __div64_const32(n, __base);	\
+		/* the remainder can be computed with 32-bit regs */ \
+		__res_lo = (n);				\
+		__rem = __n_lo - __res_lo * __base;	\
 	} else if (likely(((n) >> 32) == 0)) {		\
 		__rem = (uint32_t)(n) % __base;		\
 		(n) = (uint32_t)(n) / __base;		\
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines
  2015-11-02 22:33 ` [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines Nicolas Pitre
@ 2015-11-03  5:32   ` kbuild test robot
  2015-11-03  9:15     ` Arnd Bergmann
  0 siblings, 1 reply; 22+ messages in thread
From: kbuild test robot @ 2015-11-03  5:32 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: kbuild-all, Alexey Brodkin, Måns Rullgård,
	Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2872 bytes --]

Hi Nicolas,

[auto build test WARNING on asm-generic/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Nicolas-Pitre/div64-h-optimize-do_div-for-power-of-two-constant-divisors/20151103-065348
config: parisc-allyesconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=parisc 

All warnings (new ones prefixed by >>):

   net/can/bcm.c: In function 'bcm_proc_show':
>> net/can/bcm.c:223:1: warning: the frame size of 1156 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +223 net/can/bcm.c

6755aeba Eric Dumazet    2009-11-06  207  				op->can_id,
6755aeba Eric Dumazet    2009-11-06  208  				bcm_proc_getifname(ifname, op->ifindex),
ffd980f9 Oliver Hartkopp 2007-11-16  209  				op->nframes);
ffd980f9 Oliver Hartkopp 2007-11-16  210  
73e87e02 Oliver Hartkopp 2008-04-15  211  		if (op->kt_ival1.tv64)
ea00b8e2 Alexey Dobriyan 2009-08-28  212  			seq_printf(m, "t1=%lld ",
73e87e02 Oliver Hartkopp 2008-04-15  213  					(long long) ktime_to_us(op->kt_ival1));
73e87e02 Oliver Hartkopp 2008-04-15  214  
73e87e02 Oliver Hartkopp 2008-04-15  215  		if (op->kt_ival2.tv64)
ea00b8e2 Alexey Dobriyan 2009-08-28  216  			seq_printf(m, "t2=%lld ",
73e87e02 Oliver Hartkopp 2008-04-15  217  					(long long) ktime_to_us(op->kt_ival2));
ffd980f9 Oliver Hartkopp 2007-11-16  218  
ea00b8e2 Alexey Dobriyan 2009-08-28  219  		seq_printf(m, "# sent %ld\n", op->frames_abs);
ffd980f9 Oliver Hartkopp 2007-11-16  220  	}
ea00b8e2 Alexey Dobriyan 2009-08-28  221  	seq_putc(m, '\n');
ea00b8e2 Alexey Dobriyan 2009-08-28  222  	return 0;
ffd980f9 Oliver Hartkopp 2007-11-16 @223  }
ffd980f9 Oliver Hartkopp 2007-11-16  224  
ea00b8e2 Alexey Dobriyan 2009-08-28  225  static int bcm_proc_open(struct inode *inode, struct file *file)
ea00b8e2 Alexey Dobriyan 2009-08-28  226  {
d9dda78b Al Viro         2013-03-31  227  	return single_open(file, bcm_proc_show, PDE_DATA(inode));
ffd980f9 Oliver Hartkopp 2007-11-16  228  }
ffd980f9 Oliver Hartkopp 2007-11-16  229  
ea00b8e2 Alexey Dobriyan 2009-08-28  230  static const struct file_operations bcm_proc_fops = {
ea00b8e2 Alexey Dobriyan 2009-08-28  231  	.owner		= THIS_MODULE,

:::::: The code at line 223 was first introduced by commit
:::::: ffd980f976e7fd666c2e61bf8ab35107efd11828 [CAN]: Add broadcast manager (bcm) protocol

:::::: TO: Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 42341 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines
  2015-11-03  5:32   ` kbuild test robot
@ 2015-11-03  9:15     ` Arnd Bergmann
  2015-11-04 21:04       ` Nicolas Pitre
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2015-11-03  9:15 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Nicolas Pitre, kbuild-all, Alexey Brodkin,
	Måns Rullgård, rmk+kernel, linux-arch, linux-kernel

On Tuesday 03 November 2015 13:32:17 kbuild test robot wrote:
> 
>    net/can/bcm.c: In function 'bcm_proc_show':
> >> net/can/bcm.c:223:1: warning: the frame size of 1156 bytes is larger than 1024 bytes [-Wframe-larger-than=]
>     }

Interesting, that is a lot of stack for a function that only has a couple
of local variables:

#define IFNAMSIZ        16
        char ifname[IFNAMSIZ];
        struct sock *sk = (struct sock *)m->private;
        struct bcm_sock *bo = bcm_sk(sk);
        struct bcm_op *op;


This is a parisc-allyesconfig kernel, so I assume that CONFIG_PROFILE_ALL_BRANCHES
is on, which instruments every 'if' in the kernel. If that causes problems,
we could decide to disable the do_div optimization whenever CONFIG_PROFILE_ALL_BRANCHES
is enabled.

	Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines
  2015-11-03  9:15     ` Arnd Bergmann
@ 2015-11-04 21:04       ` Nicolas Pitre
  2015-11-04 21:42         ` Måns Rullgård
  0 siblings, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-04 21:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: kbuild test robot, kbuild-all, Alexey Brodkin,
	Måns Rullgård, rmk+kernel, linux-arch, linux-kernel

On Tue, 3 Nov 2015, Arnd Bergmann wrote:

> On Tuesday 03 November 2015 13:32:17 kbuild test robot wrote:
> > 
> >    net/can/bcm.c: In function 'bcm_proc_show':
> > >> net/can/bcm.c:223:1: warning: the frame size of 1156 bytes is larger than 1024 bytes [-Wframe-larger-than=]
> >     }
> 
> Interesting, that is a lot of stack for a function that only has a couple
> of local variables:
> 
> #define IFNAMSIZ        16
>         char ifname[IFNAMSIZ];
>         struct sock *sk = (struct sock *)m->private;
>         struct bcm_sock *bo = bcm_sk(sk);
>         struct bcm_op *op;
> 
> 
> This is a parisc-allyesconfig kernel, so I assume that CONFIG_PROFILE_ALL_BRANCHES
> is on, which instruments every 'if' in the kernel. If that causes problems,
> we could decide to disable the do_div optimization whenever CONFIG_PROFILE_ALL_BRANCHES
> is enabled.

I have an ARM allyesconfig build here where that function needs a frame 
of 88 bytes only. And that is with my do_div optimization applied.

With the do_div optimization turned off, the stack frame is still 88 
bytes.

Turning on CONFIG_PROFILE_ALL_BRANCHES makes the frame size to grow to 
96 bytes.

Keeping CONFIG_PROFILE_ALL_BRANCHES=y and activating the do_div 
optimization again, and the function frame size goes back to 88 bytes.

So I wonder what parisc gcc could be doing with this code.


Nicolas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines
  2015-11-04 21:04       ` Nicolas Pitre
@ 2015-11-04 21:42         ` Måns Rullgård
  2015-11-04 21:42           ` Måns Rullgård
  0 siblings, 1 reply; 22+ messages in thread
From: Måns Rullgård @ 2015-11-04 21:42 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Arnd Bergmann, kbuild test robot, kbuild-all, Alexey Brodkin,
	rmk+kernel, linux-arch, linux-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Tue, 3 Nov 2015, Arnd Bergmann wrote:
>
>> On Tuesday 03 November 2015 13:32:17 kbuild test robot wrote:
>> > 
>> >    net/can/bcm.c: In function 'bcm_proc_show':
>> > >> net/can/bcm.c:223:1: warning: the frame size of 1156 bytes is larger than 1024 bytes [-Wframe-larger-than=]
>> >     }
>> 
>> Interesting, that is a lot of stack for a function that only has a couple
>> of local variables:
>> 
>> #define IFNAMSIZ        16
>>         char ifname[IFNAMSIZ];
>>         struct sock *sk = (struct sock *)m->private;
>>         struct bcm_sock *bo = bcm_sk(sk);
>>         struct bcm_op *op;
>> 
>> 
>> This is a parisc-allyesconfig kernel, so I assume that CONFIG_PROFILE_ALL_BRANCHES
>> is on, which instruments every 'if' in the kernel. If that causes problems,
>> we could decide to disable the do_div optimization whenever CONFIG_PROFILE_ALL_BRANCHES
>> is enabled.
>
> I have an ARM allyesconfig build here where that function needs a frame 
> of 88 bytes only. And that is with my do_div optimization applied.
>
> With the do_div optimization turned off, the stack frame is still 88 
> bytes.
>
> Turning on CONFIG_PROFILE_ALL_BRANCHES makes the frame size to grow to 
> 96 bytes.
>
> Keeping CONFIG_PROFILE_ALL_BRANCHES=y and activating the do_div 
> optimization again, and the function frame size goes back to 88 bytes.
>
> So I wonder what parisc gcc could be doing with this code.

I've seen parisc gcc do many strange things.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines
  2015-11-04 21:42         ` Måns Rullgård
@ 2015-11-04 21:42           ` Måns Rullgård
  0 siblings, 0 replies; 22+ messages in thread
From: Måns Rullgård @ 2015-11-04 21:42 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Arnd Bergmann, kbuild test robot, kbuild-all, Alexey Brodkin,
	rmk+kernel, linux-arch, linux-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Tue, 3 Nov 2015, Arnd Bergmann wrote:
>
>> On Tuesday 03 November 2015 13:32:17 kbuild test robot wrote:
>> > 
>> >    net/can/bcm.c: In function 'bcm_proc_show':
>> > >> net/can/bcm.c:223:1: warning: the frame size of 1156 bytes is larger than 1024 bytes [-Wframe-larger-than=]
>> >     }
>> 
>> Interesting, that is a lot of stack for a function that only has a couple
>> of local variables:
>> 
>> #define IFNAMSIZ        16
>>         char ifname[IFNAMSIZ];
>>         struct sock *sk = (struct sock *)m->private;
>>         struct bcm_sock *bo = bcm_sk(sk);
>>         struct bcm_op *op;
>> 
>> 
>> This is a parisc-allyesconfig kernel, so I assume that CONFIG_PROFILE_ALL_BRANCHES
>> is on, which instruments every 'if' in the kernel. If that causes problems,
>> we could decide to disable the do_div optimization whenever CONFIG_PROFILE_ALL_BRANCHES
>> is enabled.
>
> I have an ARM allyesconfig build here where that function needs a frame 
> of 88 bytes only. And that is with my do_div optimization applied.
>
> With the do_div optimization turned off, the stack frame is still 88 
> bytes.
>
> Turning on CONFIG_PROFILE_ALL_BRANCHES makes the frame size to grow to 
> 96 bytes.
>
> Keeping CONFIG_PROFILE_ALL_BRANCHES=y and activating the do_div 
> optimization again, and the function frame size goes back to 88 bytes.
>
> So I wonder what parisc gcc could be doing with this code.

I've seen parisc gcc do many strange things.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/5] __div64_const32(): abstract out the actual 128-bit cross product code
  2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
                   ` (2 preceding siblings ...)
  2015-11-02 22:33 ` [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines Nicolas Pitre
@ 2015-11-02 22:33 ` Nicolas Pitre
  2015-11-02 22:33   ` Nicolas Pitre
  2015-11-02 22:33 ` [PATCH 4/5] __div64_32(): make it overridable at compile time Nicolas Pitre
  2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
  5 siblings, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

The default C implementation for the 128-bit cross product is abstracted
into the __arch_xprod_64() macro that can be overridden to let
architectures provide their own assembly optimized implementation.

There are many advantages to an assembly version for this operation.
Carry bit handling becomes trivial, and 32-bit shifts may be achieved
simply by inverting register pairs on some architectures.  This has the
potential to be quite faster and use much fewer instructions.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 81 ++++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 30 deletions(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index c3612873e4..34722c5a80 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -73,7 +73,7 @@
 	 * do the trick here).						\
 	 */								\
 	uint64_t ___res, ___x, ___t, ___m, ___n = (n);			\
-	uint32_t ___p, ___bias, ___m_lo, ___m_hi, ___n_lo, ___n_hi;	\
+	uint32_t ___p, ___bias;						\
 									\
 	/* determine number of bits to represent b */			\
 	___p = 1 << __div64_fls(___b);					\
@@ -148,41 +148,62 @@
 	 * 2) whether or not there might be an overflow in the cross	\
 	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
 	 *								\
-	 * Select the best way to do (m_bias + m * n) / (p << 64).	\
+	 * Select the best way to do (m_bias + m * n) / (1 << 64).	\
 	 * From now on there will be actual runtime code generated.	\
 	 */								\
-									\
-	___m_lo = ___m;							\
-	___m_hi = ___m >> 32;						\
-	___n_lo = ___n;							\
-	___n_hi = ___n >> 32;						\
-									\
-	if (!___bias) {							\
-		___res = ((uint64_t)___m_lo * ___n_lo) >> 32;		\
-	} else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {		\
-		___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32;	\
-	} else {							\
-		___res = ___m + (uint64_t)___m_lo * ___n_lo;		\
-		___t = (___res < ___m) ? (1ULL << 32) : 0;		\
-		___res = (___res >> 32) + ___t;				\
-	}								\
-									\
-	if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {			\
-		___res += (uint64_t)___m_lo * ___n_hi;			\
-		___res += (uint64_t)___m_hi * ___n_lo;			\
-		___res >>= 32;						\
-	} else {							\
-		___t = ___res += (uint64_t)___m_lo * ___n_hi;		\
-		___res += (uint64_t)___m_hi * ___n_lo;			\
-		___t = (___res < ___t) ? (1ULL << 32) : 0;		\
-		___res = (___res >> 32) + ___t;				\
-	}								\
-									\
-	___res += (uint64_t)___m_hi * ___n_hi;				\
+	___res = __arch_xprod_64(___m, ___n, ___bias);			\
 									\
 	___res /= ___p;							\
 })
 
+#ifndef __arch_xprod_64
+/*
+ * Default C implementation for __arch_xprod_64()
+ *
+ * Prototype: uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
+ * Semantic:  retval = ((bias ? m : 0) + m * n) >> 64
+ *
+ * The product is a 128-bit value, scaled down to 64 bits.
+ * Assuming constant propagation to optimize away unused conditional code.
+ * Architectures may provide their own optimized assembly implementation.
+ */
+static inline uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
+{
+	uint32_t m_lo = m;
+	uint32_t m_hi = m >> 32;
+	uint32_t n_lo = n;
+	uint32_t n_hi = n >> 32;
+	uint64_t res, tmp;
+
+	if (!bias) {
+		res = ((uint64_t)m_lo * n_lo) >> 32;
+	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		/* there can't be any overflow here */
+		res = (m + (uint64_t)m_lo * n_lo) >> 32;
+	} else {
+		res = m + (uint64_t)m_lo * n_lo;
+		tmp = (res < m) ? (1ULL << 32) : 0;
+		res = (res >> 32) + tmp;
+	}
+
+	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		/* there can't be any overflow here */
+		res += (uint64_t)m_lo * n_hi;
+		res += (uint64_t)m_hi * n_lo;
+		res >>= 32;
+	} else {
+		tmp = res += (uint64_t)m_lo * n_hi;
+		res += (uint64_t)m_hi * n_lo;
+		tmp = (res < tmp) ? (1ULL << 32) : 0;
+		res = (res >> 32) + tmp;
+	}
+
+	res += (uint64_t)m_hi * n_hi;
+
+	return res;
+}
+#endif
+
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 
 /* The unnecessary pointer compare is there
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] __div64_const32(): abstract out the actual 128-bit cross product code
  2015-11-02 22:33 ` [PATCH 3/5] __div64_const32(): abstract out the actual 128-bit cross product code Nicolas Pitre
@ 2015-11-02 22:33   ` Nicolas Pitre
  0 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

The default C implementation for the 128-bit cross product is abstracted
into the __arch_xprod_64() macro that can be overridden to let
architectures provide their own assembly optimized implementation.

There are many advantages to an assembly version for this operation.
Carry bit handling becomes trivial, and 32-bit shifts may be achieved
simply by inverting register pairs on some architectures.  This has the
potential to be quite faster and use much fewer instructions.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 81 ++++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 30 deletions(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index c3612873e4..34722c5a80 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -73,7 +73,7 @@
 	 * do the trick here).						\
 	 */								\
 	uint64_t ___res, ___x, ___t, ___m, ___n = (n);			\
-	uint32_t ___p, ___bias, ___m_lo, ___m_hi, ___n_lo, ___n_hi;	\
+	uint32_t ___p, ___bias;						\
 									\
 	/* determine number of bits to represent b */			\
 	___p = 1 << __div64_fls(___b);					\
@@ -148,41 +148,62 @@
 	 * 2) whether or not there might be an overflow in the cross	\
 	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
 	 *								\
-	 * Select the best way to do (m_bias + m * n) / (p << 64).	\
+	 * Select the best way to do (m_bias + m * n) / (1 << 64).	\
 	 * From now on there will be actual runtime code generated.	\
 	 */								\
-									\
-	___m_lo = ___m;							\
-	___m_hi = ___m >> 32;						\
-	___n_lo = ___n;							\
-	___n_hi = ___n >> 32;						\
-									\
-	if (!___bias) {							\
-		___res = ((uint64_t)___m_lo * ___n_lo) >> 32;		\
-	} else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {		\
-		___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32;	\
-	} else {							\
-		___res = ___m + (uint64_t)___m_lo * ___n_lo;		\
-		___t = (___res < ___m) ? (1ULL << 32) : 0;		\
-		___res = (___res >> 32) + ___t;				\
-	}								\
-									\
-	if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {			\
-		___res += (uint64_t)___m_lo * ___n_hi;			\
-		___res += (uint64_t)___m_hi * ___n_lo;			\
-		___res >>= 32;						\
-	} else {							\
-		___t = ___res += (uint64_t)___m_lo * ___n_hi;		\
-		___res += (uint64_t)___m_hi * ___n_lo;			\
-		___t = (___res < ___t) ? (1ULL << 32) : 0;		\
-		___res = (___res >> 32) + ___t;				\
-	}								\
-									\
-	___res += (uint64_t)___m_hi * ___n_hi;				\
+	___res = __arch_xprod_64(___m, ___n, ___bias);			\
 									\
 	___res /= ___p;							\
 })
 
+#ifndef __arch_xprod_64
+/*
+ * Default C implementation for __arch_xprod_64()
+ *
+ * Prototype: uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
+ * Semantic:  retval = ((bias ? m : 0) + m * n) >> 64
+ *
+ * The product is a 128-bit value, scaled down to 64 bits.
+ * Assuming constant propagation to optimize away unused conditional code.
+ * Architectures may provide their own optimized assembly implementation.
+ */
+static inline uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
+{
+	uint32_t m_lo = m;
+	uint32_t m_hi = m >> 32;
+	uint32_t n_lo = n;
+	uint32_t n_hi = n >> 32;
+	uint64_t res, tmp;
+
+	if (!bias) {
+		res = ((uint64_t)m_lo * n_lo) >> 32;
+	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		/* there can't be any overflow here */
+		res = (m + (uint64_t)m_lo * n_lo) >> 32;
+	} else {
+		res = m + (uint64_t)m_lo * n_lo;
+		tmp = (res < m) ? (1ULL << 32) : 0;
+		res = (res >> 32) + tmp;
+	}
+
+	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		/* there can't be any overflow here */
+		res += (uint64_t)m_lo * n_hi;
+		res += (uint64_t)m_hi * n_lo;
+		res >>= 32;
+	} else {
+		tmp = res += (uint64_t)m_lo * n_hi;
+		res += (uint64_t)m_hi * n_lo;
+		tmp = (res < tmp) ? (1ULL << 32) : 0;
+		res = (res >> 32) + tmp;
+	}
+
+	res += (uint64_t)m_hi * n_hi;
+
+	return res;
+}
+#endif
+
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 
 /* The unnecessary pointer compare is there
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] __div64_32(): make it overridable at compile time
  2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
                   ` (3 preceding siblings ...)
  2015-11-02 22:33 ` [PATCH 3/5] __div64_const32(): abstract out the actual 128-bit cross product code Nicolas Pitre
@ 2015-11-02 22:33 ` Nicolas Pitre
  2015-11-02 22:33   ` Nicolas Pitre
  2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
  5 siblings, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

Some architectures may want to override the default implementation
at compile time to do things inline.  For example, ARM uses a
non-standard calling convention for better efficiency in this case.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 2 ++
 lib/div64.c                 | 6 ++++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 34722c5a80..6daad94055 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -204,7 +204,9 @@ static inline uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
 }
 #endif
 
+#ifndef __div64_32
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
+#endif
 
 /* The unnecessary pointer compare is there
  * to check for type safety (n must be 64bit)
diff --git a/lib/div64.c b/lib/div64.c
index 19ea7ed4b9..69df4f6d52 100644
--- a/lib/div64.c
+++ b/lib/div64.c
@@ -13,7 +13,8 @@
  *
  * Code generated for this function might be very inefficient
  * for some CPUs. __div64_32() can be overridden by linking arch-specific
- * assembly versions such as arch/ppc/lib/div64.S and arch/sh/lib/div64.S.
+ * assembly versions such as arch/ppc/lib/div64.S and arch/sh/lib/div64.S
+ * or by defining a preprocessor macro in arch/include/asm/div64.h.
  */
 
 #include <linux/export.h>
@@ -23,6 +24,7 @@
 /* Not needed on 64bit architectures */
 #if BITS_PER_LONG == 32
 
+#ifndef __div64_32
 uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
 {
 	uint64_t rem = *n;
@@ -55,8 +57,8 @@ uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
 	*n = res;
 	return rem;
 }
-
 EXPORT_SYMBOL(__div64_32);
+#endif
 
 #ifndef div_s64_rem
 s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] __div64_32(): make it overridable at compile time
  2015-11-02 22:33 ` [PATCH 4/5] __div64_32(): make it overridable at compile time Nicolas Pitre
@ 2015-11-02 22:33   ` Nicolas Pitre
  0 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

Some architectures may want to override the default implementation
at compile time to do things inline.  For example, ARM uses a
non-standard calling convention for better efficiency in this case.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 include/asm-generic/div64.h | 2 ++
 lib/div64.c                 | 6 ++++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 34722c5a80..6daad94055 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -204,7 +204,9 @@ static inline uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
 }
 #endif
 
+#ifndef __div64_32
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
+#endif
 
 /* The unnecessary pointer compare is there
  * to check for type safety (n must be 64bit)
diff --git a/lib/div64.c b/lib/div64.c
index 19ea7ed4b9..69df4f6d52 100644
--- a/lib/div64.c
+++ b/lib/div64.c
@@ -13,7 +13,8 @@
  *
  * Code generated for this function might be very inefficient
  * for some CPUs. __div64_32() can be overridden by linking arch-specific
- * assembly versions such as arch/ppc/lib/div64.S and arch/sh/lib/div64.S.
+ * assembly versions such as arch/ppc/lib/div64.S and arch/sh/lib/div64.S
+ * or by defining a preprocessor macro in arch/include/asm/div64.h.
  */
 
 #include <linux/export.h>
@@ -23,6 +24,7 @@
 /* Not needed on 64bit architectures */
 #if BITS_PER_LONG == 32
 
+#ifndef __div64_32
 uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
 {
 	uint64_t rem = *n;
@@ -55,8 +57,8 @@ uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
 	*n = res;
 	return rem;
 }
-
 EXPORT_SYMBOL(__div64_32);
+#endif
 
 #ifndef div_s64_rem
 s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
                   ` (4 preceding siblings ...)
  2015-11-02 22:33 ` [PATCH 4/5] __div64_32(): make it overridable at compile time Nicolas Pitre
@ 2015-11-02 22:33 ` Nicolas Pitre
  2015-11-02 22:33   ` Nicolas Pitre
                     ` (3 more replies)
  5 siblings, 4 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

Now that the constant divisor optimization is made generic, adapt the
ARM case to it.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/include/asm/div64.h | 283 ++++++++++++++-----------------------------
 1 file changed, 93 insertions(+), 190 deletions(-)

diff --git a/arch/arm/include/asm/div64.h b/arch/arm/include/asm/div64.h
index 662c7bd061..626bbb3671 100644
--- a/arch/arm/include/asm/div64.h
+++ b/arch/arm/include/asm/div64.h
@@ -5,9 +5,9 @@
 #include <asm/compiler.h>
 
 /*
- * The semantics of do_div() are:
+ * The semantics of __div64_32() are:
  *
- * uint32_t do_div(uint64_t *n, uint32_t base)
+ * uint32_t __div64_32(uint64_t *n, uint32_t base)
  * {
  * 	uint32_t remainder = *n % base;
  * 	*n = *n / base;
@@ -16,8 +16,9 @@
  *
  * In other words, a 64-bit dividend with a 32-bit divisor producing
  * a 64-bit result and a 32-bit remainder.  To accomplish this optimally
- * we call a special __do_div64 helper with completely non standard
- * calling convention for arguments and results (beware).
+ * we override the generic version in lib/div64.c to call our __do_div64
+ * assembly implementation with completely non standard calling convention
+ * for arguments and results (beware).
  */
 
 #ifdef __ARMEB__
@@ -28,199 +29,101 @@
 #define __xh "r1"
 #endif
 
-#define __do_div_asm(n, base)					\
-({								\
-	register unsigned int __base      asm("r4") = base;	\
-	register unsigned long long __n   asm("r0") = n;	\
-	register unsigned long long __res asm("r2");		\
-	register unsigned int __rem       asm(__xh);		\
-	asm(	__asmeq("%0", __xh)				\
-		__asmeq("%1", "r2")				\
-		__asmeq("%2", "r0")				\
-		__asmeq("%3", "r4")				\
-		"bl	__do_div64"				\
-		: "=r" (__rem), "=r" (__res)			\
-		: "r" (__n), "r" (__base)			\
-		: "ip", "lr", "cc");				\
-	n = __res;						\
-	__rem;							\
-})
-
-#if __GNUC__ < 4 || !defined(CONFIG_AEABI)
+static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
+{
+	register unsigned int __base      asm("r4") = base;
+	register unsigned long long __n   asm("r0") = *n;
+	register unsigned long long __res asm("r2");
+	register unsigned int __rem       asm(__xh);
+	asm(	__asmeq("%0", __xh)
+		__asmeq("%1", "r2")
+		__asmeq("%2", "r0")
+		__asmeq("%3", "r4")
+		"bl	__do_div64"
+		: "=r" (__rem), "=r" (__res)
+		: "r" (__n), "r" (__base)
+		: "ip", "lr", "cc");
+	*n = __res;
+	return __rem;
+}
+#define __div64_32 __div64_32
+
+#if !defined(CONFIG_AEABI)
 
 /*
- * gcc versions earlier than 4.0 are simply too problematic for the
- * optimized implementation below. First there is gcc PR 15089 that
- * tend to trig on more complex constructs, spurious .global __udivsi3
- * are inserted even if none of those symbols are referenced in the
- * generated code, and those gcc versions are not able to do constant
- * propagation on long long values anyway.
+ * In OABI configurations, some uses of the do_div function
+ * cause gcc to run out of registers. To work around that,
+ * we can force the use of the out-of-line version for
+ * configurations that build a OABI kernel.
  */
-#define do_div(n, base) __do_div_asm(n, base)
-
-#elif __GNUC__ >= 4
+#define do_div(n, base) __div64_32(&(n), base)
 
-#include <asm/bug.h>
+#else
 
 /*
- * If the divisor happens to be constant, we determine the appropriate
- * inverse at compile time to turn the division into a few inline
- * multiplications instead which is much faster. And yet only if compiling
- * for ARMv4 or higher (we need umull/umlal) and if the gcc version is
- * sufficiently recent to perform proper long long constant propagation.
- * (It is unfortunate that gcc doesn't perform all this internally.)
+ * gcc versions earlier than 4.0 are simply too problematic for the
+ * __div64_const32() code in asm-generic/div64.h. First there is
+ * gcc PR 15089 that tend to trig on more complex constructs, spurious
+ * .global __udivsi3 are inserted even if none of those symbols are
+ * referenced in the generated code, and those gcc versions are not able
+ * to do constant propagation on long long values anyway.
  */
-#define do_div(n, base)							\
-({									\
-	unsigned int __r, __b = (base);					\
-	if (!__builtin_constant_p(__b) || __b == 0 ||			\
-	    (__LINUX_ARM_ARCH__ < 4 && (__b & (__b - 1)) != 0)) {	\
-		/* non-constant divisor (or zero): slow path */		\
-		__r = __do_div_asm(n, __b);				\
-	} else if ((__b & (__b - 1)) == 0) {				\
-		/* Trivial: __b is constant and a power of 2 */		\
-		/* gcc does the right thing with this code.  */		\
-		__r = n;						\
-		__r &= (__b - 1);					\
-		n /= __b;						\
-	} else {							\
-		/* Multiply by inverse of __b: n/b = n*(p/b)/p       */	\
-		/* We rely on the fact that most of this code gets   */	\
-		/* optimized away at compile time due to constant    */	\
-		/* propagation and only a couple inline assembly     */	\
-		/* instructions should remain. Better avoid any      */	\
-		/* code construct that might prevent that.           */	\
-		unsigned long long __res, __x, __t, __m, __n = n;	\
-		unsigned int __c, __p, __z = 0;				\
-		/* preserve low part of n for reminder computation */	\
-		__r = __n;						\
-		/* determine number of bits to represent __b */		\
-		__p = 1 << __div64_fls(__b);				\
-		/* compute __m = ((__p << 64) + __b - 1) / __b */	\
-		__m = (~0ULL / __b) * __p;				\
-		__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;	\
-		/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */	\
-		__x = ~0ULL / __b * __b - 1;				\
-		__res = (__m & 0xffffffff) * (__x & 0xffffffff);	\
-		__res >>= 32;						\
-		__res += (__m & 0xffffffff) * (__x >> 32);		\
-		__t = __res;						\
-		__res += (__x & 0xffffffff) * (__m >> 32);		\
-		__t = (__res < __t) ? (1ULL << 32) : 0;			\
-		__res = (__res >> 32) + __t;				\
-		__res += (__m >> 32) * (__x >> 32);			\
-		__res /= __p;						\
-		/* Now sanitize and optimize what we've got. */		\
-		if (~0ULL % (__b / (__b & -__b)) == 0) {		\
-			/* those cases can be simplified with: */	\
-			__n /= (__b & -__b);				\
-			__m = ~0ULL / (__b / (__b & -__b));		\
-			__p = 1;					\
-			__c = 1;					\
-		} else if (__res != __x / __b) {			\
-			/* We can't get away without a correction    */	\
-			/* to compensate for bit truncation errors.  */	\
-			/* To avoid it we'd need an additional bit   */	\
-			/* to represent __m which would overflow it. */	\
-			/* Instead we do m=p/b and n/b=(n*m+m)/p.    */	\
-			__c = 1;					\
-			/* Compute __m = (__p << 64) / __b */		\
-			__m = (~0ULL / __b) * __p;			\
-			__m += ((~0ULL % __b + 1) * __p) / __b;		\
-		} else {						\
-			/* Reduce __m/__p, and try to clear bit 31   */	\
-			/* of __m when possible otherwise that'll    */	\
-			/* need extra overflow handling later.       */	\
-			unsigned int __bits = -(__m & -__m);		\
-			__bits |= __m >> 32;				\
-			__bits = (~__bits) << 1;			\
-			/* If __bits == 0 then setting bit 31 is     */	\
-			/* unavoidable.  Simply apply the maximum    */	\
-			/* possible reduction in that case.          */	\
-			/* Otherwise the MSB of __bits indicates the */	\
-			/* best reduction we should apply.           */	\
-			if (!__bits) {					\
-				__p /= (__m & -__m);			\
-				__m /= (__m & -__m);			\
-			} else {					\
-				__p >>= __div64_fls(__bits);		\
-				__m >>= __div64_fls(__bits);		\
-			}						\
-			/* No correction needed. */			\
-			__c = 0;					\
-		}							\
-		/* Now we have a combination of 2 conditions:        */	\
-		/* 1) whether or not we need a correction (__c), and */	\
-		/* 2) whether or not there might be an overflow in   */	\
-		/*    the cross product (__m & ((1<<63) | (1<<31)))  */	\
-		/* Select the best insn combination to perform the   */	\
-		/* actual __m * __n / (__p << 64) operation.         */	\
-		if (!__c) {						\
-			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"		\
-				"mov	%Q0, #0"			\
-				: "=&r" (__res)				\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		} else if (!(__m & ((1ULL << 63) | (1ULL << 31)))) {	\
-			__res = __m;					\
-			asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"		\
-				"mov	%Q0, #0"			\
-				: "+&r" (__res)				\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		} else {						\
-			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"		\
-				"cmn	%Q0, %Q1\n\t"			\
-				"adcs	%R0, %R0, %R1\n\t"		\
-				"adc	%Q0, %3, #0"			\
-				: "=&r" (__res)				\
-				: "r" (__m), "r" (__n), "r" (__z)	\
-				: "cc" );				\
-		}							\
-		if (!(__m & ((1ULL << 63) | (1ULL << 31)))) {		\
-			asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"		\
-				"umlal	%R0, %Q0, %Q1, %R2\n\t"		\
-				"mov	%R0, #0\n\t"			\
-				"umlal	%Q0, %R0, %R1, %R2"		\
-				: "+&r" (__res)				\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		} else {						\
-			asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"		\
-				"umlal	%R0, %1, %Q2, %R3\n\t"		\
-				"mov	%R0, #0\n\t"			\
-				"adds	%Q0, %1, %Q0\n\t"		\
-				"adc	%R0, %R0, #0\n\t"		\
-				"umlal	%Q0, %R0, %R2, %R3"		\
-				: "+&r" (__res), "+&r" (__z)		\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		}							\
-		__res /= __p;						\
-		/* The reminder can be computed with 32-bit regs     */	\
-		/* only, and gcc is good at that.                    */	\
-		{							\
-			unsigned int __res0 = __res;			\
-			unsigned int __b0 = __b;			\
-			__r -= __res0 * __b0;				\
-		}							\
-		/* BUG_ON(__r >= __b || __res * __b + __r != n); */	\
-		n = __res;						\
-	}								\
-	__r;								\
-})
-
-/* our own fls implementation to make sure constant propagation is fine */
-#define __div64_fls(bits)						\
-({									\
-	unsigned int __left = (bits), __nr = 0;				\
-	if (__left & 0xffff0000) __nr += 16, __left >>= 16;		\
-	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;		\
-	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;		\
-	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;		\
-	if (__left & 0x00000002) __nr +=  1;				\
-	__nr;								\
-})
+
+#define __div64_const32_is_OK (__GNUC__ >= 4)
+
+static inline uint64_t __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
+{
+	unsigned long long res;
+	unsigned int tmp = 0;
+
+	if (!bias) {
+		asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
+			"mov	%Q0, #0"
+			: "=&r" (res)
+			: "r" (m), "r" (n)
+			: "cc");
+	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		res = m;
+		asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
+			"mov	%Q0, #0"
+			: "+&r" (res)
+			: "r" (m), "r" (n)
+			: "cc");
+	} else {
+		asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
+			"cmn	%Q0, %Q2\n\t"
+			"adcs	%R0, %R0, %R2\n\t"
+			"adc	%Q0, %1, #0"
+			: "=&r" (res), "+&r" (tmp)
+			: "r" (m), "r" (n)
+			: "cc");
+	}
+
+	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"
+			"umlal	%R0, %Q0, %Q1, %R2\n\t"
+			"mov	%R0, #0\n\t"
+			"umlal	%Q0, %R0, %R1, %R2"
+			: "+&r" (res)
+			: "r" (m), "r" (n)
+			: "cc");
+	} else {
+		asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"
+			"umlal	%R0, %1, %Q2, %R3\n\t"
+			"mov	%R0, #0\n\t"
+			"adds	%Q0, %1, %Q0\n\t"
+			"adc	%R0, %R0, #0\n\t"
+			"umlal	%Q0, %R0, %R2, %R3"
+			: "+&r" (res), "+&r" (tmp)
+			: "r" (m), "r" (n)
+			: "cc");
+	}
+
+	return res;
+}
+#define __arch_xprod_64 __arch_xprod_64
+
+#include <asm-generic/div64.h>
 
 #endif
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
@ 2015-11-02 22:33   ` Nicolas Pitre
  2015-11-03  1:25   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-02 22:33 UTC (permalink / raw)
  To: Alexey Brodkin, Måns Rullgård
  Cc: Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

Now that the constant divisor optimization is made generic, adapt the
ARM case to it.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/include/asm/div64.h | 283 ++++++++++++++-----------------------------
 1 file changed, 93 insertions(+), 190 deletions(-)

diff --git a/arch/arm/include/asm/div64.h b/arch/arm/include/asm/div64.h
index 662c7bd061..626bbb3671 100644
--- a/arch/arm/include/asm/div64.h
+++ b/arch/arm/include/asm/div64.h
@@ -5,9 +5,9 @@
 #include <asm/compiler.h>
 
 /*
- * The semantics of do_div() are:
+ * The semantics of __div64_32() are:
  *
- * uint32_t do_div(uint64_t *n, uint32_t base)
+ * uint32_t __div64_32(uint64_t *n, uint32_t base)
  * {
  * 	uint32_t remainder = *n % base;
  * 	*n = *n / base;
@@ -16,8 +16,9 @@
  *
  * In other words, a 64-bit dividend with a 32-bit divisor producing
  * a 64-bit result and a 32-bit remainder.  To accomplish this optimally
- * we call a special __do_div64 helper with completely non standard
- * calling convention for arguments and results (beware).
+ * we override the generic version in lib/div64.c to call our __do_div64
+ * assembly implementation with completely non standard calling convention
+ * for arguments and results (beware).
  */
 
 #ifdef __ARMEB__
@@ -28,199 +29,101 @@
 #define __xh "r1"
 #endif
 
-#define __do_div_asm(n, base)					\
-({								\
-	register unsigned int __base      asm("r4") = base;	\
-	register unsigned long long __n   asm("r0") = n;	\
-	register unsigned long long __res asm("r2");		\
-	register unsigned int __rem       asm(__xh);		\
-	asm(	__asmeq("%0", __xh)				\
-		__asmeq("%1", "r2")				\
-		__asmeq("%2", "r0")				\
-		__asmeq("%3", "r4")				\
-		"bl	__do_div64"				\
-		: "=r" (__rem), "=r" (__res)			\
-		: "r" (__n), "r" (__base)			\
-		: "ip", "lr", "cc");				\
-	n = __res;						\
-	__rem;							\
-})
-
-#if __GNUC__ < 4 || !defined(CONFIG_AEABI)
+static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
+{
+	register unsigned int __base      asm("r4") = base;
+	register unsigned long long __n   asm("r0") = *n;
+	register unsigned long long __res asm("r2");
+	register unsigned int __rem       asm(__xh);
+	asm(	__asmeq("%0", __xh)
+		__asmeq("%1", "r2")
+		__asmeq("%2", "r0")
+		__asmeq("%3", "r4")
+		"bl	__do_div64"
+		: "=r" (__rem), "=r" (__res)
+		: "r" (__n), "r" (__base)
+		: "ip", "lr", "cc");
+	*n = __res;
+	return __rem;
+}
+#define __div64_32 __div64_32
+
+#if !defined(CONFIG_AEABI)
 
 /*
- * gcc versions earlier than 4.0 are simply too problematic for the
- * optimized implementation below. First there is gcc PR 15089 that
- * tend to trig on more complex constructs, spurious .global __udivsi3
- * are inserted even if none of those symbols are referenced in the
- * generated code, and those gcc versions are not able to do constant
- * propagation on long long values anyway.
+ * In OABI configurations, some uses of the do_div function
+ * cause gcc to run out of registers. To work around that,
+ * we can force the use of the out-of-line version for
+ * configurations that build a OABI kernel.
  */
-#define do_div(n, base) __do_div_asm(n, base)
-
-#elif __GNUC__ >= 4
+#define do_div(n, base) __div64_32(&(n), base)
 
-#include <asm/bug.h>
+#else
 
 /*
- * If the divisor happens to be constant, we determine the appropriate
- * inverse at compile time to turn the division into a few inline
- * multiplications instead which is much faster. And yet only if compiling
- * for ARMv4 or higher (we need umull/umlal) and if the gcc version is
- * sufficiently recent to perform proper long long constant propagation.
- * (It is unfortunate that gcc doesn't perform all this internally.)
+ * gcc versions earlier than 4.0 are simply too problematic for the
+ * __div64_const32() code in asm-generic/div64.h. First there is
+ * gcc PR 15089 that tend to trig on more complex constructs, spurious
+ * .global __udivsi3 are inserted even if none of those symbols are
+ * referenced in the generated code, and those gcc versions are not able
+ * to do constant propagation on long long values anyway.
  */
-#define do_div(n, base)							\
-({									\
-	unsigned int __r, __b = (base);					\
-	if (!__builtin_constant_p(__b) || __b == 0 ||			\
-	    (__LINUX_ARM_ARCH__ < 4 && (__b & (__b - 1)) != 0)) {	\
-		/* non-constant divisor (or zero): slow path */		\
-		__r = __do_div_asm(n, __b);				\
-	} else if ((__b & (__b - 1)) == 0) {				\
-		/* Trivial: __b is constant and a power of 2 */		\
-		/* gcc does the right thing with this code.  */		\
-		__r = n;						\
-		__r &= (__b - 1);					\
-		n /= __b;						\
-	} else {							\
-		/* Multiply by inverse of __b: n/b = n*(p/b)/p       */	\
-		/* We rely on the fact that most of this code gets   */	\
-		/* optimized away at compile time due to constant    */	\
-		/* propagation and only a couple inline assembly     */	\
-		/* instructions should remain. Better avoid any      */	\
-		/* code construct that might prevent that.           */	\
-		unsigned long long __res, __x, __t, __m, __n = n;	\
-		unsigned int __c, __p, __z = 0;				\
-		/* preserve low part of n for reminder computation */	\
-		__r = __n;						\
-		/* determine number of bits to represent __b */		\
-		__p = 1 << __div64_fls(__b);				\
-		/* compute __m = ((__p << 64) + __b - 1) / __b */	\
-		__m = (~0ULL / __b) * __p;				\
-		__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;	\
-		/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */	\
-		__x = ~0ULL / __b * __b - 1;				\
-		__res = (__m & 0xffffffff) * (__x & 0xffffffff);	\
-		__res >>= 32;						\
-		__res += (__m & 0xffffffff) * (__x >> 32);		\
-		__t = __res;						\
-		__res += (__x & 0xffffffff) * (__m >> 32);		\
-		__t = (__res < __t) ? (1ULL << 32) : 0;			\
-		__res = (__res >> 32) + __t;				\
-		__res += (__m >> 32) * (__x >> 32);			\
-		__res /= __p;						\
-		/* Now sanitize and optimize what we've got. */		\
-		if (~0ULL % (__b / (__b & -__b)) == 0) {		\
-			/* those cases can be simplified with: */	\
-			__n /= (__b & -__b);				\
-			__m = ~0ULL / (__b / (__b & -__b));		\
-			__p = 1;					\
-			__c = 1;					\
-		} else if (__res != __x / __b) {			\
-			/* We can't get away without a correction    */	\
-			/* to compensate for bit truncation errors.  */	\
-			/* To avoid it we'd need an additional bit   */	\
-			/* to represent __m which would overflow it. */	\
-			/* Instead we do m=p/b and n/b=(n*m+m)/p.    */	\
-			__c = 1;					\
-			/* Compute __m = (__p << 64) / __b */		\
-			__m = (~0ULL / __b) * __p;			\
-			__m += ((~0ULL % __b + 1) * __p) / __b;		\
-		} else {						\
-			/* Reduce __m/__p, and try to clear bit 31   */	\
-			/* of __m when possible otherwise that'll    */	\
-			/* need extra overflow handling later.       */	\
-			unsigned int __bits = -(__m & -__m);		\
-			__bits |= __m >> 32;				\
-			__bits = (~__bits) << 1;			\
-			/* If __bits == 0 then setting bit 31 is     */	\
-			/* unavoidable.  Simply apply the maximum    */	\
-			/* possible reduction in that case.          */	\
-			/* Otherwise the MSB of __bits indicates the */	\
-			/* best reduction we should apply.           */	\
-			if (!__bits) {					\
-				__p /= (__m & -__m);			\
-				__m /= (__m & -__m);			\
-			} else {					\
-				__p >>= __div64_fls(__bits);		\
-				__m >>= __div64_fls(__bits);		\
-			}						\
-			/* No correction needed. */			\
-			__c = 0;					\
-		}							\
-		/* Now we have a combination of 2 conditions:        */	\
-		/* 1) whether or not we need a correction (__c), and */	\
-		/* 2) whether or not there might be an overflow in   */	\
-		/*    the cross product (__m & ((1<<63) | (1<<31)))  */	\
-		/* Select the best insn combination to perform the   */	\
-		/* actual __m * __n / (__p << 64) operation.         */	\
-		if (!__c) {						\
-			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"		\
-				"mov	%Q0, #0"			\
-				: "=&r" (__res)				\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		} else if (!(__m & ((1ULL << 63) | (1ULL << 31)))) {	\
-			__res = __m;					\
-			asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"		\
-				"mov	%Q0, #0"			\
-				: "+&r" (__res)				\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		} else {						\
-			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"		\
-				"cmn	%Q0, %Q1\n\t"			\
-				"adcs	%R0, %R0, %R1\n\t"		\
-				"adc	%Q0, %3, #0"			\
-				: "=&r" (__res)				\
-				: "r" (__m), "r" (__n), "r" (__z)	\
-				: "cc" );				\
-		}							\
-		if (!(__m & ((1ULL << 63) | (1ULL << 31)))) {		\
-			asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"		\
-				"umlal	%R0, %Q0, %Q1, %R2\n\t"		\
-				"mov	%R0, #0\n\t"			\
-				"umlal	%Q0, %R0, %R1, %R2"		\
-				: "+&r" (__res)				\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		} else {						\
-			asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"		\
-				"umlal	%R0, %1, %Q2, %R3\n\t"		\
-				"mov	%R0, #0\n\t"			\
-				"adds	%Q0, %1, %Q0\n\t"		\
-				"adc	%R0, %R0, #0\n\t"		\
-				"umlal	%Q0, %R0, %R2, %R3"		\
-				: "+&r" (__res), "+&r" (__z)		\
-				: "r" (__m), "r" (__n)			\
-				: "cc" );				\
-		}							\
-		__res /= __p;						\
-		/* The reminder can be computed with 32-bit regs     */	\
-		/* only, and gcc is good at that.                    */	\
-		{							\
-			unsigned int __res0 = __res;			\
-			unsigned int __b0 = __b;			\
-			__r -= __res0 * __b0;				\
-		}							\
-		/* BUG_ON(__r >= __b || __res * __b + __r != n); */	\
-		n = __res;						\
-	}								\
-	__r;								\
-})
-
-/* our own fls implementation to make sure constant propagation is fine */
-#define __div64_fls(bits)						\
-({									\
-	unsigned int __left = (bits), __nr = 0;				\
-	if (__left & 0xffff0000) __nr += 16, __left >>= 16;		\
-	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;		\
-	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;		\
-	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;		\
-	if (__left & 0x00000002) __nr +=  1;				\
-	__nr;								\
-})
+
+#define __div64_const32_is_OK (__GNUC__ >= 4)
+
+static inline uint64_t __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
+{
+	unsigned long long res;
+	unsigned int tmp = 0;
+
+	if (!bias) {
+		asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
+			"mov	%Q0, #0"
+			: "=&r" (res)
+			: "r" (m), "r" (n)
+			: "cc");
+	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		res = m;
+		asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
+			"mov	%Q0, #0"
+			: "+&r" (res)
+			: "r" (m), "r" (n)
+			: "cc");
+	} else {
+		asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
+			"cmn	%Q0, %Q2\n\t"
+			"adcs	%R0, %R0, %R2\n\t"
+			"adc	%Q0, %1, #0"
+			: "=&r" (res), "+&r" (tmp)
+			: "r" (m), "r" (n)
+			: "cc");
+	}
+
+	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
+		asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"
+			"umlal	%R0, %Q0, %Q1, %R2\n\t"
+			"mov	%R0, #0\n\t"
+			"umlal	%Q0, %R0, %R1, %R2"
+			: "+&r" (res)
+			: "r" (m), "r" (n)
+			: "cc");
+	} else {
+		asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"
+			"umlal	%R0, %1, %Q2, %R3\n\t"
+			"mov	%R0, #0\n\t"
+			"adds	%Q0, %1, %Q0\n\t"
+			"adc	%R0, %R0, #0\n\t"
+			"umlal	%Q0, %R0, %R2, %R3"
+			: "+&r" (res), "+&r" (tmp)
+			: "r" (m), "r" (n)
+			: "cc");
+	}
+
+	return res;
+}
+#define __arch_xprod_64 __arch_xprod_64
+
+#include <asm-generic/div64.h>
 
 #endif
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
  2015-11-02 22:33   ` Nicolas Pitre
@ 2015-11-03  1:25   ` kbuild test robot
  2015-11-03  4:03     ` Nicolas Pitre
  2015-11-03 21:39   ` kbuild test robot
  2015-11-19 16:36   ` Måns Rullgård
  3 siblings, 1 reply; 22+ messages in thread
From: kbuild test robot @ 2015-11-03  1:25 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: kbuild-all, Alexey Brodkin, Måns Rullgård,
	Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 20753 bytes --]

Hi Nicolas,

[auto build test WARNING on asm-generic/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Nicolas-Pitre/div64-h-optimize-do_div-for-power-of-two-constant-divisors/20151103-065348
config: arm-multi_v7_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from include/asm-generic/bug.h:13,
                    from arch/arm/include/asm/bug.h:62,
                    from include/linux/bug.h:4,
                    from include/linux/io.h:23,
                    from include/linux/clk-provider.h:14,
                    from drivers/clk/imx/clk-pllv1.c:1:
   drivers/clk/imx/clk-pllv1.c: In function 'clk_pllv1_recalc_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> drivers/clk/imx/clk-pllv1.c:99:2: note: in expansion of macro 'do_div'
     do_div(ll, mfd + 1);
     ^
--
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from drivers/clk/imx/clk-pllv2.c:1:
   drivers/clk/imx/clk-pllv2.c: In function '__clk_pllv2_recalc_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> drivers/clk/imx/clk-pllv2.c:103:2: note: in expansion of macro 'do_div'
     do_div(temp, mfd + 1);
     ^
   drivers/clk/imx/clk-pllv2.c: In function '__clk_pllv2_set_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   drivers/clk/imx/clk-pllv2.c:145:2: note: in expansion of macro 'do_div'
     do_div(temp64, quad_parent_rate / 1000000);
     ^
--
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from drivers/clk/tegra/clk-divider.c:17:
   drivers/clk/tegra/clk-divider.c: In function 'get_div':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> drivers/clk/tegra/clk-divider.c:50:2: note: in expansion of macro 'do_div'
     do_div(divider_ux1, rate);
     ^
--
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from drivers/clk/ti/clkt_dpll.c:17:
   drivers/clk/ti/clkt_dpll.c: In function 'omap2_get_dpll_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> drivers/clk/ti/clkt_dpll.c:266:2: note: in expansion of macro 'do_div'
     do_div(dpll_clk, dpll_div + 1);
     ^
--
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from include/linux/clk.h:16,
                    from drivers/clk/ti/fapll.c:12:
   drivers/clk/ti/fapll.c: In function 'ti_fapll_recalc_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> drivers/clk/ti/fapll.c:182:3: note: in expansion of macro 'do_div'
      do_div(rate, fapll_p);
      ^
   drivers/clk/ti/fapll.c: In function 'ti_fapll_synth_recalc_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   drivers/clk/ti/fapll.c:346:3: note: in expansion of macro 'do_div'
      do_div(rate, synth_div_freq);
      ^
--
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from include/linux/list.h:8,
                    from include/linux/preempt.h:10,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/gpu/drm/nouveau/include/nvif/os.h:5,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/event.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/subdev/clk.h:3,
                    from drivers/gpu/drm/nouveau/nvkm/subdev/clk/priv.h:4,
                    from drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:26:
   drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c: In function 'gk20a_pllg_calc_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
     do_div(rate, divider);
     ^
   drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:144:2: warning: right shift count >= width of type
   In file included from include/linux/kernel.h:136:0,
                    from include/linux/list.h:8,
                    from include/linux/preempt.h:10,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/gpu/drm/nouveau/include/nvif/os.h:5,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/event.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:3,
                    from drivers/gpu/drm/nouveau/include/nvkm/subdev/clk.h:3,
                    from drivers/gpu/drm/nouveau/nvkm/subdev/clk/priv.h:4,
                    from drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:26:
>> arch/arm/include/asm/div64.h:49:20: warning: passing argument 1 of '__div64_32' from incompatible pointer type
    #define __div64_32 __div64_32
                       ^
>> include/asm-generic/div64.h:235:11: note: in expansion of macro '__div64_32'
      __rem = __div64_32(&(n), __base); \
              ^
   drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
     do_div(rate, divider);
     ^
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t *' but argument is of type 'u32 *'
    static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
                           ^
--
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from include/linux/list.h:8,
                    from include/linux/preempt.h:10,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/gpu/drm//nouveau/include/nvif/os.h:5,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/os.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/event.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/device.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/subdev.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/subdev/clk.h:3,
                    from drivers/gpu/drm//nouveau/nvkm/subdev/clk/priv.h:4,
                    from drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:26:
   drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c: In function 'gk20a_pllg_calc_rate':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
     do_div(rate, divider);
     ^
   drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:144:2: warning: right shift count >= width of type
   In file included from include/linux/kernel.h:136:0,
                    from include/linux/list.h:8,
                    from include/linux/preempt.h:10,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/gpu/drm//nouveau/include/nvif/os.h:5,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/os.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/event.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/device.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/core/subdev.h:3,
                    from drivers/gpu/drm//nouveau/include/nvkm/subdev/clk.h:3,
                    from drivers/gpu/drm//nouveau/nvkm/subdev/clk/priv.h:4,
                    from drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:26:
>> arch/arm/include/asm/div64.h:49:20: warning: passing argument 1 of '__div64_32' from incompatible pointer type
    #define __div64_32 __div64_32
                       ^
>> include/asm-generic/div64.h:235:11: note: in expansion of macro '__div64_32'
      __rem = __div64_32(&(n), __base); \
              ^
   drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
     do_div(rate, divider);
     ^
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t *' but argument is of type 'u32 *'
    static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
                           ^

vim +/do_div +99 drivers/clk/imx/clk-pllv1.c

2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   @1  #include <linux/clk-provider.h>
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    2  #include <linux/io.h>
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    3  #include <linux/slab.h>
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    4  #include <linux/kernel.h>
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    5  #include <linux/err.h>
3a84d17b arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11    6  
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    7  #include "clk.h"
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    8  
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    9  /**
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   10   * pll v1
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   11   *
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   12   * @clk_hw	clock source
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   13   * @parent	the parent clock name
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   14   * @base	base address of pll registers
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   15   *
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   16   * PLL clock version 1, found on i.MX1/21/25/27/31/35
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   17   */
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   18  
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   19  #define MFN_BITS	(10)
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   20  #define MFN_SIGN	(BIT(MFN_BITS - 1))
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   21  #define MFN_MASK	(MFN_SIGN - 1)
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   22  
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   23  struct clk_pllv1 {
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   24  	struct clk_hw	hw;
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   25  	void __iomem	*base;
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   26  	enum imx_pllv1_type type;
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   27  };
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   28  
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   29  #define to_clk_pllv1(clk) (container_of(clk, struct clk_pllv1, clk))
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   30  
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   31  static inline bool is_imx1_pllv1(struct clk_pllv1 *pll)
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   32  {
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   33  	return pll->type == IMX_PLLV1_IMX1;
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   34  }
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   35  
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   36  static inline bool is_imx21_pllv1(struct clk_pllv1 *pll)
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   37  {
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   38  	return pll->type == IMX_PLLV1_IMX21;
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   39  }
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   40  
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   41  static inline bool is_imx27_pllv1(struct clk_pllv1 *pll)
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   42  {
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   43  	return pll->type == IMX_PLLV1_IMX27;
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   44  }
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   45  
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   46  static inline bool mfn_is_negative(struct clk_pllv1 *pll, unsigned int mfn)
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   47  {
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   48  	return !is_imx1_pllv1(pll) && !is_imx21_pllv1(pll) && (mfn & MFN_SIGN);
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   49  }
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   50  
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   51  static unsigned long clk_pllv1_recalc_rate(struct clk_hw *hw,
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   52  		unsigned long parent_rate)
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   53  {
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   54  	struct clk_pllv1 *pll = to_clk_pllv1(hw);
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   55  	long long ll;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   56  	int mfn_abs;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   57  	unsigned int mfi, mfn, mfd, pd;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   58  	u32 reg;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   59  	unsigned long rate;
2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   60  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   61  	reg = readl(pll->base);
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   62  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   63  	/*
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   64  	 * Get the resulting clock rate from a PLL register value and the input
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   65  	 * frequency. PLLs with this register layout can be found on i.MX1,
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   66  	 * i.MX21, i.MX27 and i,MX31
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   67  	 *
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   68  	 *                  mfi + mfn / (mfd + 1)
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   69  	 *  f = 2 * f_ref * --------------------
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   70  	 *                        pd + 1
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   71  	 */
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   72  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   73  	mfi = (reg >> 10) & 0xf;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   74  	mfn = reg & 0x3ff;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   75  	mfd = (reg >> 16) & 0x3ff;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   76  	pd =  (reg >> 26) & 0xf;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   77  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   78  	mfi = mfi <= 5 ? 5 : mfi;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   79  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   80  	mfn_abs = mfn;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   81  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   82  	/*
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   83  	 * On all i.MXs except i.MX1 and i.MX21 mfn is a 10bit
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   84  	 * 2's complements number.
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   85  	 * On i.MX27 the bit 9 is the sign bit.
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   86  	 */
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   87  	if (mfn_is_negative(pll, mfn)) {
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   88  		if (is_imx27_pllv1(pll))
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   89  			mfn_abs = mfn & MFN_MASK;
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   90  		else
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   91  			mfn_abs = BIT(MFN_BITS) - mfn;
a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   92  	}
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   93  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   94  	rate = parent_rate * 2;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   95  	rate /= pd + 1;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   96  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   97  	ll = (unsigned long long)rate * mfn_abs;
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   98  
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11  @99  	do_div(ll, mfd + 1);
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11  100  
3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26  101  	if (mfn_is_negative(pll, mfn))
a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11  102  		ll = -ll;

:::::: The code at line 99 was first introduced by commit
:::::: a6dd3c812e774b876d440c1a9ec1bd0fd5659390 ARM: i.MX clk pllv1: move mxc_decode_pll code to its user

:::::: TO: Sascha Hauer <s.hauer@pengutronix.de>
:::::: CC: Sascha Hauer <s.hauer@pengutronix.de>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 34547 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-03  1:25   ` kbuild test robot
@ 2015-11-03  4:03     ` Nicolas Pitre
  0 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-03  4:03 UTC (permalink / raw)
  To: Michael Turquette, linux-clk, David Airlie, dri-devel
  Cc: kbuild test robot, kbuild-all, Alexey Brodkin,
	Måns Rullgård, Arnd Bergmann, rmk+kernel, linux-arch,
	linux-kernel

[added Mike/linux-clk and David/dri-devel]

A patch I produced is now highlighting existing bugs in the drivers 
listed below.

On Tue, 3 Nov 2015, kbuild test robot wrote:

> Hi Nicolas,
> 
> [auto build test WARNING on asm-generic/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base]
> 
> url:    https://github.com/0day-ci/linux/commits/Nicolas-Pitre/div64-h-optimize-do_div-for-power-of-two-constant-divisors/20151103-065348
> config: arm-multi_v7_defconfig (attached as .config)
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from include/asm-generic/bug.h:13,
>                     from arch/arm/include/asm/bug.h:62,
>                     from include/linux/bug.h:4,
>                     from include/linux/io.h:23,
>                     from include/linux/clk-provider.h:14,
>                     from drivers/clk/imx/clk-pllv1.c:1:
>    drivers/clk/imx/clk-pllv1.c: In function 'clk_pllv1_recalc_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
> >> drivers/clk/imx/clk-pllv1.c:99:2: note: in expansion of macro 'do_div'
>      do_div(ll, mfd + 1);
>      ^

Here the problem is in clk-pllv1.c where the ll variable is declared as 
a long long. It should be an unsigned long long, or better yet an 
uint64_t or u64.

> --
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from drivers/clk/imx/clk-pllv2.c:1:
>    drivers/clk/imx/clk-pllv2.c: In function '__clk_pllv2_recalc_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
> >> drivers/clk/imx/clk-pllv2.c:103:2: note: in expansion of macro 'do_div'
>      do_div(temp, mfd + 1);
>      ^

Same thing: temp is declared as a s64. It should be u64.

>    drivers/clk/imx/clk-pllv2.c: In function '__clk_pllv2_set_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
>    drivers/clk/imx/clk-pllv2.c:145:2: note: in expansion of macro 'do_div'
>      do_div(temp64, quad_parent_rate / 1000000);
>      ^

Ditto here.

> --
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from drivers/clk/tegra/clk-divider.c:17:
>    drivers/clk/tegra/clk-divider.c: In function 'get_div':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
> >> drivers/clk/tegra/clk-divider.c:50:2: note: in expansion of macro 'do_div'
>      do_div(divider_ux1, rate);
>      ^

Ditto here.

> --
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from drivers/clk/ti/clkt_dpll.c:17:
>    drivers/clk/ti/clkt_dpll.c: In function 'omap2_get_dpll_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
> >> drivers/clk/ti/clkt_dpll.c:266:2: note: in expansion of macro 'do_div'
>      do_div(dpll_clk, dpll_div + 1);
>      ^

Ditto here.

> --
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from include/linux/clk.h:16,
>                     from drivers/clk/ti/fapll.c:12:
>    drivers/clk/ti/fapll.c: In function 'ti_fapll_recalc_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
> >> drivers/clk/ti/fapll.c:182:3: note: in expansion of macro 'do_div'
>       do_div(rate, fapll_p);
>       ^

Ditto here.

>    drivers/clk/ti/fapll.c: In function 'ti_fapll_synth_recalc_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
>    drivers/clk/ti/fapll.c:346:3: note: in expansion of macro 'do_div'
>       do_div(rate, synth_div_freq);
>       ^

Ditto here.

> --
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from include/linux/list.h:8,
>                     from include/linux/preempt.h:10,
>                     from include/linux/spinlock.h:50,
>                     from include/linux/mmzone.h:7,
>                     from include/linux/gfp.h:5,
>                     from include/linux/slab.h:14,
>                     from drivers/gpu/drm/nouveau/include/nvif/os.h:5,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/event.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/subdev/clk.h:3,
>                     from drivers/gpu/drm/nouveau/nvkm/subdev/clk/priv.h:4,
>                     from drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:26:
>    drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c: In function 'gk20a_pllg_calc_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^
>    drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
>      do_div(rate, divider);
>      ^
>    drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:144:2: warning: right shift count >= width of type
>    In file included from include/linux/kernel.h:136:0,
>                     from include/linux/list.h:8,
>                     from include/linux/preempt.h:10,
>                     from include/linux/spinlock.h:50,
>                     from include/linux/mmzone.h:7,
>                     from include/linux/gfp.h:5,
>                     from include/linux/slab.h:14,
>                     from drivers/gpu/drm/nouveau/include/nvif/os.h:5,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/event.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:3,
>                     from drivers/gpu/drm/nouveau/include/nvkm/subdev/clk.h:3,
>                     from drivers/gpu/drm/nouveau/nvkm/subdev/clk/priv.h:4,
>                     from drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:26:
> >> arch/arm/include/asm/div64.h:49:20: warning: passing argument 1 of '__div64_32' from incompatible pointer type
>     #define __div64_32 __div64_32
>                        ^
> >> include/asm-generic/div64.h:235:11: note: in expansion of macro '__div64_32'
>       __rem = __div64_32(&(n), __base); \
>               ^
>    drivers/gpu/drm/nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
>      do_div(rate, divider);
>      ^
>    arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t *' but argument is of type 'u32 *'
>     static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
>                            ^

This one is a plain bug.

> --
>    In file included from arch/arm/include/asm/div64.h:126:0,
>                     from include/linux/kernel.h:136,
>                     from include/linux/list.h:8,
>                     from include/linux/preempt.h:10,
>                     from include/linux/spinlock.h:50,
>                     from include/linux/mmzone.h:7,
>                     from include/linux/gfp.h:5,
>                     from include/linux/slab.h:14,
>                     from drivers/gpu/drm//nouveau/include/nvif/os.h:5,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/os.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/event.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/device.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/subdev.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/subdev/clk.h:3,
>                     from drivers/gpu/drm//nouveau/nvkm/subdev/clk/priv.h:4,
>                     from drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:26:
>    drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c: In function 'gk20a_pllg_calc_rate':
>    include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
>      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
>                                ^

This one is a driver bug too.

>    drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
>      do_div(rate, divider);
>      ^
>    drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:144:2: warning: right shift count >= width of type
>    In file included from include/linux/kernel.h:136:0,
>                     from include/linux/list.h:8,
>                     from include/linux/preempt.h:10,
>                     from include/linux/spinlock.h:50,
>                     from include/linux/mmzone.h:7,
>                     from include/linux/gfp.h:5,
>                     from include/linux/slab.h:14,
>                     from drivers/gpu/drm//nouveau/include/nvif/os.h:5,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/os.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/event.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/device.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/core/subdev.h:3,
>                     from drivers/gpu/drm//nouveau/include/nvkm/subdev/clk.h:3,
>                     from drivers/gpu/drm//nouveau/nvkm/subdev/clk/priv.h:4,
>                     from drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:26:
> >> arch/arm/include/asm/div64.h:49:20: warning: passing argument 1 of '__div64_32' from incompatible pointer type
>     #define __div64_32 __div64_32
>                        ^
> >> include/asm-generic/div64.h:235:11: note: in expansion of macro '__div64_32'
>       __rem = __div64_32(&(n), __base); \
>               ^
>    drivers/gpu/drm//nouveau/nvkm/subdev/clk/gk20a.c:144:2: note: in expansion of macro 'do_div'
>      do_div(rate, divider);
>      ^
>    arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t *' but argument is of type 'u32 *'
>     static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
>                            ^
> 
> vim +/do_div +99 drivers/clk/imx/clk-pllv1.c
> 
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   @1  #include <linux/clk-provider.h>
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    2  #include <linux/io.h>
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    3  #include <linux/slab.h>
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    4  #include <linux/kernel.h>
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    5  #include <linux/err.h>
> 3a84d17b arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11    6  
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    7  #include "clk.h"
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    8  
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09    9  /**
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   10   * pll v1
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   11   *
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   12   * @clk_hw	clock source
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   13   * @parent	the parent clock name
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   14   * @base	base address of pll registers
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   15   *
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   16   * PLL clock version 1, found on i.MX1/21/25/27/31/35
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   17   */
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   18  
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   19  #define MFN_BITS	(10)
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   20  #define MFN_SIGN	(BIT(MFN_BITS - 1))
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   21  #define MFN_MASK	(MFN_SIGN - 1)
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   22  
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   23  struct clk_pllv1 {
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   24  	struct clk_hw	hw;
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   25  	void __iomem	*base;
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   26  	enum imx_pllv1_type type;
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   27  };
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   28  
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   29  #define to_clk_pllv1(clk) (container_of(clk, struct clk_pllv1, clk))
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   30  
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   31  static inline bool is_imx1_pllv1(struct clk_pllv1 *pll)
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   32  {
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   33  	return pll->type == IMX_PLLV1_IMX1;
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   34  }
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   35  
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   36  static inline bool is_imx21_pllv1(struct clk_pllv1 *pll)
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   37  {
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   38  	return pll->type == IMX_PLLV1_IMX21;
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   39  }
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   40  
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   41  static inline bool is_imx27_pllv1(struct clk_pllv1 *pll)
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   42  {
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   43  	return pll->type == IMX_PLLV1_IMX27;
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   44  }
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   45  
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   46  static inline bool mfn_is_negative(struct clk_pllv1 *pll, unsigned int mfn)
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   47  {
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   48  	return !is_imx1_pllv1(pll) && !is_imx21_pllv1(pll) && (mfn & MFN_SIGN);
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   49  }
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   50  
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   51  static unsigned long clk_pllv1_recalc_rate(struct clk_hw *hw,
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   52  		unsigned long parent_rate)
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   53  {
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   54  	struct clk_pllv1 *pll = to_clk_pllv1(hw);
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   55  	long long ll;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   56  	int mfn_abs;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   57  	unsigned int mfi, mfn, mfd, pd;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   58  	u32 reg;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   59  	unsigned long rate;
> 2af9e6db arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-03-09   60  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   61  	reg = readl(pll->base);
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   62  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   63  	/*
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   64  	 * Get the resulting clock rate from a PLL register value and the input
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   65  	 * frequency. PLLs with this register layout can be found on i.MX1,
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   66  	 * i.MX21, i.MX27 and i,MX31
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   67  	 *
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   68  	 *                  mfi + mfn / (mfd + 1)
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   69  	 *  f = 2 * f_ref * --------------------
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   70  	 *                        pd + 1
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   71  	 */
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   72  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   73  	mfi = (reg >> 10) & 0xf;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   74  	mfn = reg & 0x3ff;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   75  	mfd = (reg >> 16) & 0x3ff;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   76  	pd =  (reg >> 26) & 0xf;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   77  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   78  	mfi = mfi <= 5 ? 5 : mfi;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   79  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   80  	mfn_abs = mfn;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   81  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   82  	/*
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   83  	 * On all i.MXs except i.MX1 and i.MX21 mfn is a 10bit
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   84  	 * 2's complements number.
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   85  	 * On i.MX27 the bit 9 is the sign bit.
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   86  	 */
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   87  	if (mfn_is_negative(pll, mfn)) {
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26   88  		if (is_imx27_pllv1(pll))
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   89  			mfn_abs = mfn & MFN_MASK;
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   90  		else
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   91  			mfn_abs = BIT(MFN_BITS) - mfn;
> a5947903 arch/arm/mach-imx/clk-pllv1.c Alexander Shiyan 2013-11-10   92  	}
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   93  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   94  	rate = parent_rate * 2;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   95  	rate /= pd + 1;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   96  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   97  	ll = (unsigned long long)rate * mfn_abs;
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11   98  
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11  @99  	do_div(ll, mfd + 1);
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11  100  
> 3bec5f81 arch/arm/mach-imx/clk-pllv1.c Shawn Guo        2015-04-26  101  	if (mfn_is_negative(pll, mfn))
> a6dd3c81 arch/arm/mach-imx/clk-pllv1.c Sascha Hauer     2012-09-11  102  		ll = -ll;
> 
> :::::: The code at line 99 was first introduced by commit
> :::::: a6dd3c812e774b876d440c1a9ec1bd0fd5659390 ARM: i.MX clk pllv1: move mxc_decode_pll code to its user
> 
> :::::: TO: Sascha Hauer <s.hauer@pengutronix.de>
> :::::: CC: Sascha Hauer <s.hauer@pengutronix.de>
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
  2015-11-02 22:33   ` Nicolas Pitre
  2015-11-03  1:25   ` kbuild test robot
@ 2015-11-03 21:39   ` kbuild test robot
  2015-11-19 16:36   ` Måns Rullgård
  3 siblings, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2015-11-03 21:39 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: kbuild-all, Alexey Brodkin, Måns Rullgård,
	Arnd Bergmann, rmk+kernel, linux-arch, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6111 bytes --]

Hi Nicolas,

[auto build test WARNING on asm-generic/master]
[also WARNING on: v4.3 next-20151103]

url:    https://github.com/0day-ci/linux/commits/Nicolas-Pitre/div64-h-optimize-do_div-for-power-of-two-constant-divisors/20151103-065348
base:   https://github.com/0day-ci/linux Nicolas-Pitre/div64-h-optimize-do_div-for-power-of-two-constant-divisors/20151103-065348
config: arm-allyesconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from drivers/cpufreq/s5pv210-cpufreq.c:13:
   drivers/cpufreq/s5pv210-cpufreq.c: In function 's5pv210_set_refresh':
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> drivers/cpufreq/s5pv210-cpufreq.c:215:2: note: in expansion of macro 'do_div'
     do_div(tmp, freq);
     ^
>> drivers/cpufreq/s5pv210-cpufreq.c:215:2: warning: right shift count >= width of type
   In file included from include/linux/kernel.h:136:0,
                    from drivers/cpufreq/s5pv210-cpufreq.c:13:
   arch/arm/include/asm/div64.h:49:20: warning: passing argument 1 of '__div64_32' from incompatible pointer type
    #define __div64_32 __div64_32
                       ^
   include/asm-generic/div64.h:235:11: note: in expansion of macro '__div64_32'
      __rem = __div64_32(&(n), __base); \
              ^
>> drivers/cpufreq/s5pv210-cpufreq.c:215:2: note: in expansion of macro 'do_div'
     do_div(tmp, freq);
     ^
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t *' but argument is of type 'long unsigned int *'
    static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
                           ^
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:136,
                    from drivers/cpufreq/s5pv210-cpufreq.c:13:
   include/asm-generic/div64.h:217:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   drivers/cpufreq/s5pv210-cpufreq.c:219:2: note: in expansion of macro 'do_div'
     do_div(tmp1, tmp);
     ^
   drivers/cpufreq/s5pv210-cpufreq.c:219:2: warning: right shift count >= width of type
   In file included from include/linux/kernel.h:136:0,
                    from drivers/cpufreq/s5pv210-cpufreq.c:13:
   arch/arm/include/asm/div64.h:49:20: warning: passing argument 1 of '__div64_32' from incompatible pointer type
    #define __div64_32 __div64_32
                       ^
   include/asm-generic/div64.h:235:11: note: in expansion of macro '__div64_32'
      __rem = __div64_32(&(n), __base); \
              ^
   drivers/cpufreq/s5pv210-cpufreq.c:219:2: note: in expansion of macro 'do_div'
     do_div(tmp1, tmp);
     ^
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t *' but argument is of type 'long unsigned int *'
    static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
                           ^

vim +/do_div +215 drivers/cpufreq/s5pv210-cpufreq.c

83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  199  {
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  200  	unsigned long tmp, tmp1;
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  201  	void __iomem *reg = NULL;
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  202  
d62fa311 arch/arm/mach-s5pv210/cpufreq.c   Jonghwan Choi 2011-05-12  203  	if (ch == DMC0) {
6d4ed0f4 drivers/cpufreq/s5pv210-cpufreq.c Tomasz Figa   2014-07-03  204  		reg = (dmc_base[0] + 0x30);
d62fa311 arch/arm/mach-s5pv210/cpufreq.c   Jonghwan Choi 2011-05-12  205  	} else if (ch == DMC1) {
6d4ed0f4 drivers/cpufreq/s5pv210-cpufreq.c Tomasz Figa   2014-07-03  206  		reg = (dmc_base[1] + 0x30);
d62fa311 arch/arm/mach-s5pv210/cpufreq.c   Jonghwan Choi 2011-05-12  207  	} else {
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  208  		printk(KERN_ERR "Cannot find DMC port\n");
d62fa311 arch/arm/mach-s5pv210/cpufreq.c   Jonghwan Choi 2011-05-12  209  		return;
d62fa311 arch/arm/mach-s5pv210/cpufreq.c   Jonghwan Choi 2011-05-12  210  	}
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  211  
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  212  	/* Find current DRAM frequency */
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  213  	tmp = s5pv210_dram_conf[ch].freq;
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  214  
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12 @215  	do_div(tmp, freq);
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  216  
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  217  	tmp1 = s5pv210_dram_conf[ch].refresh;
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  218  
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  219  	do_div(tmp1, tmp);
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  220  
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  221  	__raw_writel(tmp1, reg);
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  222  }
83efc743 arch/arm/mach-s5pv210/cpufreq.c   Jaecheol Lee  2010-10-12  223  

:::::: The code at line 215 was first introduced by commit
:::::: 83efc7432f881d4f4bcedd6800710c2c4c24c58d ARM: S5PV210: Add support CPUFREQ

:::::: TO: Jaecheol Lee <jc.lee@samsung.com>
:::::: CC: Kukjin Kim <kgene.kim@samsung.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 53076 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
                     ` (2 preceding siblings ...)
  2015-11-03 21:39   ` kbuild test robot
@ 2015-11-19 16:36   ` Måns Rullgård
  2015-11-19 16:42     ` Nicolas Pitre
  3 siblings, 1 reply; 22+ messages in thread
From: Måns Rullgård @ 2015-11-19 16:36 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexey Brodkin, Arnd Bergmann, rmk+kernel, linux-arch,
	linux-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> +static inline uint64_t __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
> +{
> +	unsigned long long res;
> +	unsigned int tmp = 0;
> +
> +	if (!bias) {
> +		asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
> +			"mov	%Q0, #0"
> +			: "=&r" (res)
> +			: "r" (m), "r" (n)
> +			: "cc");
> +	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
> +		res = m;
> +		asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
> +			"mov	%Q0, #0"
> +			: "+&r" (res)
> +			: "r" (m), "r" (n)
> +			: "cc");
> +	} else {
> +		asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
> +			"cmn	%Q0, %Q2\n\t"
> +			"adcs	%R0, %R0, %R2\n\t"
> +			"adc	%Q0, %1, #0"
> +			: "=&r" (res), "+&r" (tmp)
> +			: "r" (m), "r" (n)

Why is tmp using a +r constraint here?  The register is not written, so
using an input-only operand could/should result in better code.  That is
also what the old code did.

> +			: "cc");
> +	}
> +
> +	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
> +		asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"
> +			"umlal	%R0, %Q0, %Q1, %R2\n\t"
> +			"mov	%R0, #0\n\t"
> +			"umlal	%Q0, %R0, %R1, %R2"
> +			: "+&r" (res)
> +			: "r" (m), "r" (n)
> +			: "cc");
> +	} else {
> +		asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"
> +			"umlal	%R0, %1, %Q2, %R3\n\t"
> +			"mov	%R0, #0\n\t"
> +			"adds	%Q0, %1, %Q0\n\t"
> +			"adc	%R0, %R0, #0\n\t"
> +			"umlal	%Q0, %R0, %R2, %R3"
> +			: "+&r" (res), "+&r" (tmp)
> +			: "r" (m), "r" (n)
> +			: "cc");
> +	}
> +
> +	return res;
> +}

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-19 16:36   ` Måns Rullgård
@ 2015-11-19 16:42     ` Nicolas Pitre
  2015-11-19 16:44       ` Måns Rullgård
  0 siblings, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2015-11-19 16:42 UTC (permalink / raw)
  To: Måns Rullgård
  Cc: Alexey Brodkin, Arnd Bergmann, rmk+kernel, linux-arch,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

On Thu, 19 Nov 2015, Måns Rullgård wrote:

> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> 
> > +static inline uint64_t __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
> > +{
> > +	unsigned long long res;
> > +	unsigned int tmp = 0;
> > +
> > +	if (!bias) {
> > +		asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
> > +			"mov	%Q0, #0"
> > +			: "=&r" (res)
> > +			: "r" (m), "r" (n)
> > +			: "cc");
> > +	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
> > +		res = m;
> > +		asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
> > +			"mov	%Q0, #0"
> > +			: "+&r" (res)
> > +			: "r" (m), "r" (n)
> > +			: "cc");
> > +	} else {
> > +		asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
> > +			"cmn	%Q0, %Q2\n\t"
> > +			"adcs	%R0, %R0, %R2\n\t"
> > +			"adc	%Q0, %1, #0"
> > +			: "=&r" (res), "+&r" (tmp)
> > +			: "r" (m), "r" (n)
> 
> Why is tmp using a +r constraint here?  The register is not written, so
> using an input-only operand could/should result in better code.  That is
> also what the old code did.

No, it is worse. gcc allocates two registers because, somehow, it 
doesn't think that the first one still holds zero after the first usage.  
This way usage of only one temporary register is forced throughout, 
producing better code.

I meant to have this split out in a separate patch but messed it up 
somehow.



> 
> > +			: "cc");
> > +	}
> > +
> > +	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
> > +		asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"
> > +			"umlal	%R0, %Q0, %Q1, %R2\n\t"
> > +			"mov	%R0, #0\n\t"
> > +			"umlal	%Q0, %R0, %R1, %R2"
> > +			: "+&r" (res)
> > +			: "r" (m), "r" (n)
> > +			: "cc");
> > +	} else {
> > +		asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"
> > +			"umlal	%R0, %1, %Q2, %R3\n\t"
> > +			"mov	%R0, #0\n\t"
> > +			"adds	%Q0, %1, %Q0\n\t"
> > +			"adc	%R0, %R0, #0\n\t"
> > +			"umlal	%Q0, %R0, %R2, %R3"
> > +			: "+&r" (res), "+&r" (tmp)
> > +			: "r" (m), "r" (n)
> > +			: "cc");
> > +	}
> > +
> > +	return res;
> > +}
> 
> -- 
> Måns Rullgård
> mans@mansr.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde
  2015-11-19 16:42     ` Nicolas Pitre
@ 2015-11-19 16:44       ` Måns Rullgård
  0 siblings, 0 replies; 22+ messages in thread
From: Måns Rullgård @ 2015-11-19 16:44 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexey Brodkin, Arnd Bergmann, rmk+kernel, linux-arch,
	linux-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Thu, 19 Nov 2015, Måns Rullgård wrote:
>
>> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>> 
>> > +static inline uint64_t __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
>> > +{
>> > +	unsigned long long res;
>> > +	unsigned int tmp = 0;
>> > +
>> > +	if (!bias) {
>> > +		asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
>> > +			"mov	%Q0, #0"
>> > +			: "=&r" (res)
>> > +			: "r" (m), "r" (n)
>> > +			: "cc");
>> > +	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
>> > +		res = m;
>> > +		asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
>> > +			"mov	%Q0, #0"
>> > +			: "+&r" (res)
>> > +			: "r" (m), "r" (n)
>> > +			: "cc");
>> > +	} else {
>> > +		asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
>> > +			"cmn	%Q0, %Q2\n\t"
>> > +			"adcs	%R0, %R0, %R2\n\t"
>> > +			"adc	%Q0, %1, #0"
>> > +			: "=&r" (res), "+&r" (tmp)
>> > +			: "r" (m), "r" (n)
>> 
>> Why is tmp using a +r constraint here?  The register is not written, so
>> using an input-only operand could/should result in better code.  That is
>> also what the old code did.
>
> No, it is worse. gcc allocates two registers because, somehow, it 
> doesn't think that the first one still holds zero after the first usage.  
> This way usage of only one temporary register is forced throughout, 
> producing better code.

Makes sense.  Thanks for explaining.

-- 
Måns Rullgård
mans@mansr.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-11-19 16:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-02 22:33 [PATCH 0/5] 64-by-32 ddivision optimization for constant divisors on 32-bit machines Nicolas Pitre
2015-11-02 22:33 ` Nicolas Pitre
2015-11-02 22:33 ` [PATCH 1/5] div64.h: optimize do_div() for power-of-two constant divisors Nicolas Pitre
2015-11-02 22:33   ` Nicolas Pitre
2015-11-02 22:33 ` [PATCH 2/5] do_div(): generic optimization for constant divisor on 32-bit machines Nicolas Pitre
2015-11-03  5:32   ` kbuild test robot
2015-11-03  9:15     ` Arnd Bergmann
2015-11-04 21:04       ` Nicolas Pitre
2015-11-04 21:42         ` Måns Rullgård
2015-11-04 21:42           ` Måns Rullgård
2015-11-02 22:33 ` [PATCH 3/5] __div64_const32(): abstract out the actual 128-bit cross product code Nicolas Pitre
2015-11-02 22:33   ` Nicolas Pitre
2015-11-02 22:33 ` [PATCH 4/5] __div64_32(): make it overridable at compile time Nicolas Pitre
2015-11-02 22:33   ` Nicolas Pitre
2015-11-02 22:33 ` [PATCH 5/5] ARM: asm/div64.h: adjust to generic codde Nicolas Pitre
2015-11-02 22:33   ` Nicolas Pitre
2015-11-03  1:25   ` kbuild test robot
2015-11-03  4:03     ` Nicolas Pitre
2015-11-03 21:39   ` kbuild test robot
2015-11-19 16:36   ` Måns Rullgård
2015-11-19 16:42     ` Nicolas Pitre
2015-11-19 16:44       ` Måns Rullgård

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox