linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
@ 2010-01-13 19:39 David Howells
  2010-01-13 19:39 ` [PATCH 2/3] Adjust the comment on get_order() to describe the size==0 case David Howells
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: David Howells @ 2010-01-13 19:39 UTC (permalink / raw)
  To: linux-arch; +Cc: dhowells

fls(N), ffs(N) and fls64(N) can be optimised on x86/x86_64.  Currently they
perform checks against N being 0 before invoking the BSR/BSF instruction, or
use a CMOV instruction afterwards.  Either the check involves a conditional
jump which we'd like to avoid, or a CMOV, which we'd also quite like to avoid.

Instead, we can make use of the fact that BSR/BSF doesn't modify its output
register if its input is 0.  By preloading the output with -1 and incrementing
the result, we achieve the desired result without the need for a conditional
check.

The 32-bit version of fls64() can also have all its conditional checks removed
by chaining BSRs with a subtraction in the middle.  However, this may be
suboptimal on CPUs for which BSR/BSF is really slow (i486 and below for
example).

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/bitops.h      |   36 ++++++++++++------------------------
 include/asm-generic/bitops/fls64.h |   21 +++++++++++++++------
 2 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..112d623 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -394,18 +394,12 @@ static inline unsigned long __fls(unsigned long word)
  */
 static inline int ffs(int x)
 {
-	int r;
-#ifdef CONFIG_X86_CMOV
-	asm("bsfl %1,%0\n\t"
-	    "cmovzl %2,%0"
-	    : "=r" (r) : "rm" (x), "r" (-1));
-#else
-	asm("bsfl %1,%0\n\t"
-	    "jnz 1f\n\t"
-	    "movl $-1,%0\n"
-	    "1:" : "=r" (r) : "rm" (x));
-#endif
-	return r + 1;
+	int bitpos = -1;
+
+	asm("bsfl %1,%0"
+	    : "+r" (bitpos)
+	    : "rm" (x));
+	return bitpos + 1;
 }
 
 /**
@@ -421,18 +415,12 @@ static inline int ffs(int x)
  */
 static inline int fls(int x)
 {
-	int r;
-#ifdef CONFIG_X86_CMOV
-	asm("bsrl %1,%0\n\t"
-	    "cmovzl %2,%0"
-	    : "=&r" (r) : "rm" (x), "rm" (-1));
-#else
-	asm("bsrl %1,%0\n\t"
-	    "jnz 1f\n\t"
-	    "movl $-1,%0\n"
-	    "1:" : "=r" (r) : "rm" (x));
-#endif
-	return r + 1;
+	int bitpos = -1;
+
+	asm("bsrl %1,%0"
+	    : "+r" (bitpos)
+	    : "rm" (x));
+	return bitpos + 1;
 }
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/fls64.h b/include/asm-generic/bitops/fls64.h
index b097cf8..f81e04c 100644
--- a/include/asm-generic/bitops/fls64.h
+++ b/include/asm-generic/bitops/fls64.h
@@ -18,16 +18,25 @@
 static __always_inline int fls64(__u64 x)
 {
 	__u32 h = x >> 32;
-	if (h)
-		return fls(h) + 32;
-	return fls(x);
+	int bitpos = -1;
+
+	asm("bsrl	%1,%0	\n"
+	    "subl	%2,%0	\n"
+	    "bsrl	%3,%0	\n"
+	    : "+r" (bitpos)
+	    : "rm" (x), "i"(32), "rm" (h));
+
+	return bitpos + 33;
 }
 #elif BITS_PER_LONG == 64
 static __always_inline int fls64(__u64 x)
 {
-	if (x == 0)
-		return 0;
-	return __fls(x) + 1;
+	long bitpos = -1;
+
+	asm("bsrq %1,%0"
+	    : "+r" (bitpos)
+	    : "rm" (x));
+	return bitpos + 1;
 }
 #else
 #error BITS_PER_LONG not 32 or 64

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] Adjust the comment on get_order() to describe the size==0 case
  2010-01-13 19:39 [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() David Howells
@ 2010-01-13 19:39 ` David Howells
  2010-01-13 19:39 ` [PATCH 3/3] Optimise get_order() David Howells
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2010-01-13 19:39 UTC (permalink / raw)
  To: linux-arch; +Cc: dhowells

Adjust the comment on get_order() to note that the result of passing a size of
0 results in an undefined value.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/getorder.h |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/getorder.h b/include/asm-generic/getorder.h
index 67e7245..76e9687 100644
--- a/include/asm-generic/getorder.h
+++ b/include/asm-generic/getorder.h
@@ -5,7 +5,28 @@
 
 #include <linux/compiler.h>
 
-/* Pure 2^n version of get_order */
+/**
+ * get_order - Determine the allocation order of a memory size
+ * @size: The size for which to get the order
+ *
+ * Determine the allocation order of a particular sized block of memory.  This
+ * is on a logarithmic scale, where:
+ *
+ *	0 -> 2^0 * PAGE_SIZE and below
+ *	1 -> 2^1 * PAGE_SIZE to 2^0 * PAGE_SIZE + 1
+ *	2 -> 2^2 * PAGE_SIZE to 2^1 * PAGE_SIZE + 1
+ *	3 -> 2^3 * PAGE_SIZE to 2^2 * PAGE_SIZE + 1
+ *	4 -> 2^4 * PAGE_SIZE to 2^3 * PAGE_SIZE + 1
+ *	...
+ *
+ * The order returned is used to find the smallest allocation granule required
+ * to hold an object of the specified size.
+ *
+ * The result is undefined if the size is 0.
+ *
+ * This function may be used to initialise variables with compile time
+ * evaluations of constants.
+ */
 static inline __attribute_const__ int get_order(unsigned long size)
 {
 	int order;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] Optimise get_order()
  2010-01-13 19:39 [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() David Howells
  2010-01-13 19:39 ` [PATCH 2/3] Adjust the comment on get_order() to describe the size==0 case David Howells
@ 2010-01-13 19:39 ` David Howells
  2010-01-13 20:15 ` [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() Geert Uytterhoeven
  2010-01-13 21:59 ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2010-01-13 19:39 UTC (permalink / raw)
  To: linux-arch; +Cc: dhowells

Optimise get_order() to use bit scanning instructions if such exist rather than
a loop.  Also, make it possible to use get_order() in static initialisations
too by building it on top of ilog2() in the constant parameter case.

This has been tested for i386 and x86_64 using the following userspace program,
and for FRV by making appropriate substitutions for fls() and fls64().  It will
abort if the case for get_order() deviates from the original except for the
order of 0, for which get_order() produces an undefined result.  This program
tests both dynamic and static parameters.

	#include <stdlib.h>
	#include <stdio.h>

	#ifdef __x86_64__
	#define BITS_PER_LONG 64
	#else
	#define BITS_PER_LONG 32
	#endif

	#define PAGE_SHIFT 12

	typedef unsigned long long __u64, u64;
	typedef unsigned int __u32, u32;
	#define noinline	__attribute__((noinline))

	static inline int fls(int x)
	{
		int bitpos = -1;

		asm("bsrl %1,%0"
		    : "+r" (bitpos)
		    : "rm" (x));
		return bitpos + 1;
	}

	static __always_inline int fls64(__u64 x)
	{
	#if BITS_PER_LONG == 64
		long bitpos = -1;

		asm("bsrq %1,%0"
		    : "+r" (bitpos)
		    : "rm" (x));
		return bitpos + 1;
	#else
		__u32 h = x >> 32, l = x;
		int bitpos = -1;

		asm("bsrl	%1,%0	\n"
		    "subl	%2,%0	\n"
		    "bsrl	%3,%0	\n"
		    : "+r" (bitpos)
		    : "rm" (l), "i"(32), "rm" (h));

		return bitpos + 33;
	#endif
	}

	static inline __attribute__((const))
	int __ilog2_u32(u32 n)
	{
		return fls(n) - 1;
	}

	static inline __attribute__((const))
	int __ilog2_u64(u64 n)
	{
		return fls64(n) - 1;
	}

	extern __attribute__((const, noreturn))
	int ____ilog2_NaN(void);

	#define ilog2(n)				\
	(						\
		__builtin_constant_p(n) ? (		\
			(n) < 1 ? ____ilog2_NaN() :	\
			(n) & (1ULL << 63) ? 63 :	\
			(n) & (1ULL << 62) ? 62 :	\
			(n) & (1ULL << 61) ? 61 :	\
			(n) & (1ULL << 60) ? 60 :	\
			(n) & (1ULL << 59) ? 59 :	\
			(n) & (1ULL << 58) ? 58 :	\
			(n) & (1ULL << 57) ? 57 :	\
			(n) & (1ULL << 56) ? 56 :	\
			(n) & (1ULL << 55) ? 55 :	\
			(n) & (1ULL << 54) ? 54 :	\
			(n) & (1ULL << 53) ? 53 :	\
			(n) & (1ULL << 52) ? 52 :	\
			(n) & (1ULL << 51) ? 51 :	\
			(n) & (1ULL << 50) ? 50 :	\
			(n) & (1ULL << 49) ? 49 :	\
			(n) & (1ULL << 48) ? 48 :	\
			(n) & (1ULL << 47) ? 47 :	\
			(n) & (1ULL << 46) ? 46 :	\
			(n) & (1ULL << 45) ? 45 :	\
			(n) & (1ULL << 44) ? 44 :	\
			(n) & (1ULL << 43) ? 43 :	\
			(n) & (1ULL << 42) ? 42 :	\
			(n) & (1ULL << 41) ? 41 :	\
			(n) & (1ULL << 40) ? 40 :	\
			(n) & (1ULL << 39) ? 39 :	\
			(n) & (1ULL << 38) ? 38 :	\
			(n) & (1ULL << 37) ? 37 :	\
			(n) & (1ULL << 36) ? 36 :	\
			(n) & (1ULL << 35) ? 35 :	\
			(n) & (1ULL << 34) ? 34 :	\
			(n) & (1ULL << 33) ? 33 :	\
			(n) & (1ULL << 32) ? 32 :	\
			(n) & (1ULL << 31) ? 31 :	\
			(n) & (1ULL << 30) ? 30 :	\
			(n) & (1ULL << 29) ? 29 :	\
			(n) & (1ULL << 28) ? 28 :	\
			(n) & (1ULL << 27) ? 27 :	\
			(n) & (1ULL << 26) ? 26 :	\
			(n) & (1ULL << 25) ? 25 :	\
			(n) & (1ULL << 24) ? 24 :	\
			(n) & (1ULL << 23) ? 23 :	\
			(n) & (1ULL << 22) ? 22 :	\
			(n) & (1ULL << 21) ? 21 :	\
			(n) & (1ULL << 20) ? 20 :	\
			(n) & (1ULL << 19) ? 19 :	\
			(n) & (1ULL << 18) ? 18 :	\
			(n) & (1ULL << 17) ? 17 :	\
			(n) & (1ULL << 16) ? 16 :	\
			(n) & (1ULL << 15) ? 15 :	\
			(n) & (1ULL << 14) ? 14 :	\
			(n) & (1ULL << 13) ? 13 :	\
			(n) & (1ULL << 12) ? 12 :	\
			(n) & (1ULL << 11) ? 11 :	\
			(n) & (1ULL << 10) ? 10 :	\
			(n) & (1ULL <<  9) ?  9 :	\
			(n) & (1ULL <<  8) ?  8 :	\
			(n) & (1ULL <<  7) ?  7 :	\
			(n) & (1ULL <<  6) ?  6 :	\
			(n) & (1ULL <<  5) ?  5 :	\
			(n) & (1ULL <<  4) ?  4 :	\
			(n) & (1ULL <<  3) ?  3 :	\
			(n) & (1ULL <<  2) ?  2 :	\
			(n) & (1ULL <<  1) ?  1 :	\
			(n) & (1ULL <<  0) ?  0 :	\
			____ilog2_NaN()			\
					   ) :		\
		(sizeof(n) <= 4) ?			\
		__ilog2_u32(n) :			\
		__ilog2_u64(n)				\
	 )

	static noinline __attribute__((const))
	int old_get_order(unsigned long size)
	{
		int order;

		size = (size - 1) >> (PAGE_SHIFT - 1);
		order = -1;
		do {
			size >>= 1;
			order++;
		} while (size);
		return order;
	}

	static noinline __attribute__((const))
	int __get_order(unsigned long size)
	{
		int order;
		size--;
		size >>= PAGE_SHIFT;
	#if BITS_PER_LONG == 32
		order = fls(size);
	#else
		order = fls64(size);
	#endif
		return order;
	}

	#define get_order(n)						\
	(								\
		__builtin_constant_p(n) ? (				\
			(n == 0UL) ? BITS_PER_LONG - PAGE_SHIFT :	\
			((n < (1UL << PAGE_SHIFT)) ? 0 :		\
			 ilog2((n) - 1) - PAGE_SHIFT + 1)		\
		) :							\
		__get_order(n)						\
	)

	#define order(N) \
		{ (1UL << N) - 1,	get_order((1UL << N) - 1)	},	\
		{ (1UL << N),		get_order((1UL << N))		},	\
		{ (1UL << N) + 1,	get_order((1UL << N) + 1)	}

	struct order {
		unsigned long n, order;
	};

	static const struct order order_table[] = {
		order(0),
		order(1),
		order(2),
		order(3),
		order(4),
		order(5),
		order(6),
		order(7),
		order(8),
		order(9),
		order(10),
		order(11),
		order(12),
		order(13),
		order(14),
		order(15),
		order(16),
		order(17),
		order(18),
		order(19),
		order(20),
		order(21),
		order(22),
		order(23),
		order(24),
		order(25),
		order(26),
		order(27),
		order(28),
		order(29),
		order(30),
		order(31),
	#if BITS_PER_LONG == 64
		order(32),
		order(33),
		order(34),
		order(35),
	#endif
		{ 0x2929 }
	};

	void check(int loop, unsigned long n)
	{
		unsigned long old, new;

		printf("[%2d]: %09lx | ", loop, n);

		old = old_get_order(n);
		new = get_order(n);

		printf("%3ld, %3ld\n", old, new);
		if (n != 0 && old != new)
			abort();
	}

	int main(int argc, char **argv)
	{
		const struct order *p;
		unsigned long n;
		int loop;

		for (loop = 0; loop <= BITS_PER_LONG - 1; loop++) {
			n = 1UL << loop;
			check(loop, n - 1);
			check(loop, n);
			check(loop, n + 1);
		}

		for (p = order_table; p->n != 0x2929; p++) {
			unsigned long old, new;

			old = old_get_order(p->n);
			new = p->order;
			printf("%09lx\t%3ld, %3ld\n", p->n, old, new);
			if (p->n != 0 && old != new)
				abort();
		}

		return 0;
	}

Disassembling the x86_64 version of the above code shows:

	0000000000400510 <old_get_order>:
	  400510:       48 83 ef 01             sub    $0x1,%rdi
	  400514:       b8 ff ff ff ff          mov    $0xffffffff,%eax
	  400519:       48 c1 ef 0b             shr    $0xb,%rdi
	  40051d:       0f 1f 00                nopl   (%rax)
	  400520:       83 c0 01                add    $0x1,%eax
	  400523:       48 d1 ef                shr    %rdi
	  400526:       75 f8                   jne    400520 <old_get_order+0x10>
	  400528:       f3 c3                   repz retq
	  40052a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

	0000000000400530 <__get_order>:
	  400530:       48 83 ef 01             sub    $0x1,%rdi
	  400534:       48 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%rax
	  40053b:       48 c1 ef 0c             shr    $0xc,%rdi
	  40053f:       48 0f bd c7             bsr    %rdi,%rax
	  400543:       83 c0 01                add    $0x1,%eax
	  400546:       c3                      retq
	  400547:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
	  40054e:       00 00

As can be seen, the new __get_order() function is simpler than the
old_get_order() function.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/getorder.h |   40 ++++++++++++++++++++++++++++------------
 1 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/getorder.h b/include/asm-generic/getorder.h
index 76e9687..e0fb4bf 100644
--- a/include/asm-generic/getorder.h
+++ b/include/asm-generic/getorder.h
@@ -4,6 +4,25 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/compiler.h>
+#include <linux/log2.h>
+
+/*
+ * Runtime evaluation of get_order()
+ */
+static inline __attribute_const__
+int __get_order(unsigned long size)
+{
+	int order;
+
+	size--;
+	size >>= PAGE_SHIFT;
+#if BITS_PER_LONG == 32
+	order = fls(size);
+#else
+	order = fls64(size);
+#endif
+	return order;
+}
 
 /**
  * get_order - Determine the allocation order of a memory size
@@ -27,18 +46,15 @@
  * This function may be used to initialise variables with compile time
  * evaluations of constants.
  */
-static inline __attribute_const__ int get_order(unsigned long size)
-{
-	int order;
-
-	size = (size - 1) >> (PAGE_SHIFT - 1);
-	order = -1;
-	do {
-		size >>= 1;
-		order++;
-	} while (size);
-	return order;
-}
+#define get_order(n)						\
+(								\
+	__builtin_constant_p(n) ? (				\
+		(n == 0UL) ? BITS_PER_LONG - PAGE_SHIFT :	\
+		((n < (1UL << PAGE_SHIFT)) ? 0 :		\
+		 ilog2((n) - 1) - PAGE_SHIFT + 1)		\
+	) :							\
+	__get_order(n)						\
+)
 
 #endif	/* __ASSEMBLY__ */
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-01-13 19:39 [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() David Howells
  2010-01-13 19:39 ` [PATCH 2/3] Adjust the comment on get_order() to describe the size==0 case David Howells
  2010-01-13 19:39 ` [PATCH 3/3] Optimise get_order() David Howells
@ 2010-01-13 20:15 ` Geert Uytterhoeven
  2010-01-13 21:59 ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: Geert Uytterhoeven @ 2010-01-13 20:15 UTC (permalink / raw)
  To: David Howells; +Cc: linux-arch

On Wed, Jan 13, 2010 at 20:39, David Howells <dhowells@redhat.com> wrote:
> diff --git a/include/asm-generic/bitops/fls64.h b/include/asm-generic/bitops/fls64.h
> index b097cf8..f81e04c 100644
> --- a/include/asm-generic/bitops/fls64.h
> +++ b/include/asm-generic/bitops/fls64.h
> @@ -18,16 +18,25 @@
>  static __always_inline int fls64(__u64 x)
>  {
>        __u32 h = x >> 32;
> -       if (h)
> -               return fls(h) + 32;
> -       return fls(x);
> +       int bitpos = -1;
> +
> +       asm("bsrl       %1,%0   \n"
> +           "subl       %2,%0   \n"
> +           "bsrl       %3,%0   \n"
> +           : "+r" (bitpos)
> +           : "rm" (x), "i"(32), "rm" (h));
> +
> +       return bitpos + 33;
>  }
>  #elif BITS_PER_LONG == 64
>  static __always_inline int fls64(__u64 x)
>  {
> -       if (x == 0)
> -               return 0;
> -       return __fls(x) + 1;
> +       long bitpos = -1;
> +
> +       asm("bsrq %1,%0"
> +           : "+r" (bitpos)
> +           : "rm" (x));
> +       return bitpos + 1;
>  }
>  #else
>  #error BITS_PER_LONG not 32 or 64

Why do you put x86 assembler in asm-generic? Those files are supposed
to be platform-agnostic.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-01-13 19:39 [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() David Howells
                   ` (2 preceding siblings ...)
  2010-01-13 20:15 ` [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() Geert Uytterhoeven
@ 2010-01-13 21:59 ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2010-01-13 21:59 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: dhowells, linux-arch

Geert Uytterhoeven <geert@linux-m68k.org> wrote:

> Why do you put x86 assembler in asm-generic? Those files are supposed
> to be platform-agnostic.

Sorry about that.  I assumed that was an x86_64-specific header.  One of the
downside of going places with etags.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
@ 2010-03-26 14:42 ` David Howells
  2010-03-26 17:23   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: David Howells @ 2010-03-26 14:42 UTC (permalink / raw)
  To: torvalds, mingo, tglx; +Cc: linux-arch, linux-kernel, David Howells

fls(N), ffs(N) and fls64(N) can be optimised on x86/x86_64.  Currently they
perform checks against N being 0 before invoking the BSR/BSF instruction, or
use a CMOV instruction afterwards.  Either the check involves a conditional
jump which we'd like to avoid, or a CMOV, which we'd also quite like to avoid.

Instead, we can make use of the fact that BSR/BSF doesn't modify its output
register if its input is 0.  By preloading the output with -1 and incrementing
the result, we achieve the desired result without the need for a conditional
check.

The 32-bit version of fls64() can also have all its conditional checks removed
by chaining BSRs with a subtraction in the middle.  However, this may be
suboptimal on CPUs for which BSR/BSF is really slow (i486 and below for
example).

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/bitops.h |   79 ++++++++++++++++++++++++++---------------
 1 files changed, 50 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..a8e85ab 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -394,18 +394,12 @@ static inline unsigned long __fls(unsigned long word)
  */
 static inline int ffs(int x)
 {
-	int r;
-#ifdef CONFIG_X86_CMOV
-	asm("bsfl %1,%0\n\t"
-	    "cmovzl %2,%0"
-	    : "=r" (r) : "rm" (x), "r" (-1));
-#else
-	asm("bsfl %1,%0\n\t"
-	    "jnz 1f\n\t"
-	    "movl $-1,%0\n"
-	    "1:" : "=r" (r) : "rm" (x));
-#endif
-	return r + 1;
+	int bitpos = -1;
+
+	asm("bsfl %1,%0"
+	    : "+r" (bitpos)
+	    : "rm" (x));
+	return bitpos + 1;
 }
 
 /**
@@ -421,19 +415,52 @@ static inline int ffs(int x)
  */
 static inline int fls(int x)
 {
-	int r;
-#ifdef CONFIG_X86_CMOV
-	asm("bsrl %1,%0\n\t"
-	    "cmovzl %2,%0"
-	    : "=&r" (r) : "rm" (x), "rm" (-1));
+	int bitpos = -1;
+
+	asm("bsrl %1,%0"
+	    : "+r" (bitpos)
+	    : "rm" (x));
+	return bitpos + 1;
+}
+
+/**
+ * fls64 - find last set bit in a 64-bit word
+ * @x: the word to search
+ *
+ * This is defined in a similar way as the libc and compiler builtin
+ * ffsll, but returns the position of the most significant set bit.
+ *
+ * fls64(value) returns 0 if value is 0 or the position of the last
+ * set bit if value is nonzero. The last (most significant) bit is
+ * at position 64.
+ */
+#if BITS_PER_LONG == 32
+static __always_inline int fls64(__u64 x)
+{
+	__u32 h = x >> 32;
+	int bitpos = -1;
+
+	asm("bsrl	%1,%0	\n"
+	    "subl	%2,%0	\n"
+	    "bsrl	%3,%0	\n"
+	    : "+r" (bitpos)
+	    : "rm" (x), "i"(32), "rm" (h));
+
+	return bitpos + 33;
+}
+#elif BITS_PER_LONG == 64
+static __always_inline int fls64(__u64 x)
+{
+	long bitpos = -1;
+
+	asm("bsrq %1,%0"
+	    : "+r" (bitpos)
+	    : "rm" (x));
+	return bitpos + 1;
+}
 #else
-	asm("bsrl %1,%0\n\t"
-	    "jnz 1f\n\t"
-	    "movl $-1,%0\n"
-	    "1:" : "=r" (r) : "rm" (x));
+#error BITS_PER_LONG not 32 or 64
 #endif
-	return r + 1;
-}
 #endif /* __KERNEL__ */
 
 #undef ADDR
@@ -446,12 +473,6 @@ static inline int fls(int x)
 
 #include <asm-generic/bitops/hweight.h>
 
-#endif /* __KERNEL__ */
-
-#include <asm-generic/bitops/fls64.h>
-
-#ifdef __KERNEL__
-
 #include <asm-generic/bitops/ext2-non-atomic.h>
 
 #define ext2_set_bit_atomic(lock, nr, addr)			\

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 14:42 ` David Howells
@ 2010-03-26 17:23   ` Linus Torvalds
  2010-03-26 17:37     ` Scott Lurndal
  2010-03-26 17:42   ` David Howells
  2010-04-14 13:13   ` David Howells
  2 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2010-03-26 17:23 UTC (permalink / raw)
  To: David Howells; +Cc: mingo, tglx, linux-arch, linux-kernel



On Fri, 26 Mar 2010, David Howells wrote:
>
> fls(N), ffs(N) and fls64(N) can be optimised on x86/x86_64.  Currently they
> perform checks against N being 0 before invoking the BSR/BSF instruction, or
> use a CMOV instruction afterwards.  Either the check involves a conditional
> jump which we'd like to avoid, or a CMOV, which we'd also quite like to avoid.
> 
> Instead, we can make use of the fact that BSR/BSF doesn't modify its output
> register if its input is 0.  By preloading the output with -1 and incrementing
> the result, we achieve the desired result without the need for a conditional
> check.

This is totally incorrect.

Where did you find that "doesn't modify its output" thing? It's not true. 
The truth is that the destination is undefined. Just read the dang Intel 
documentation, it's very clearly stated right there.

If you can show otherwise, feel free. But I'm pretty sure there are 
actually x86 chips out there that _do_ modify the destination. I have a 
pretty strong memory of us trying this at some point, and it not working.

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:23   ` Linus Torvalds
@ 2010-03-26 17:37     ` Scott Lurndal
  2010-03-26 17:42       ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Scott Lurndal @ 2010-03-26 17:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Howells, mingo, tglx, linux-arch, linux-kernel

On Fri, Mar 26, 2010 at 10:23:46AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 26 Mar 2010, David Howells wrote:
> >
> > fls(N), ffs(N) and fls64(N) can be optimised on x86/x86_64.  Currently they
> > perform checks against N being 0 before invoking the BSR/BSF instruction, or
> > use a CMOV instruction afterwards.  Either the check involves a conditional
> > jump which we'd like to avoid, or a CMOV, which we'd also quite like to avoid.
> > 
> > Instead, we can make use of the fact that BSR/BSF doesn't modify its output
> > register if its input is 0.  By preloading the output with -1 and incrementing
> > the result, we achieve the desired result without the need for a conditional
> > check.
> 
> This is totally incorrect.
> 
> Where did you find that "doesn't modify its output" thing? It's not true. 
> The truth is that the destination is undefined. Just read the dang Intel 
> documentation, it's very clearly stated right there.

While this is true for the current (253666-031US) Intel documentation,
the AMD documentation (rev 3.14) for the same instruction states that the
destination register is unchanged (as opposed to Intel's undefined).

I wonder if Intel's EM64 stuff makes this more deterministic, perhaps
David's implementation would work for x86_64 only?

scott

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 14:42 ` David Howells
  2010-03-26 17:23   ` Linus Torvalds
@ 2010-03-26 17:42   ` David Howells
  2010-03-26 17:45     ` Linus Torvalds
  2010-03-26 17:52     ` Matthew Wilcox
  2010-04-14 13:13   ` David Howells
  2 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2010-03-26 17:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, mingo, tglx, linux-arch, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Where did you find that "doesn't modify its output" thing? It's not true. 
> The truth is that the destination is undefined. Just read the dang Intel 
> documentation, it's very clearly stated right there.

Hmmm...  My ancient Borland Assembler dead-tree manual doesn't mention that.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:37     ` Scott Lurndal
@ 2010-03-26 17:42       ` Linus Torvalds
  2010-04-06 13:57         ` Jamie Lokier
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2010-03-26 17:42 UTC (permalink / raw)
  To: Scott Lurndal; +Cc: David Howells, mingo, tglx, linux-arch, linux-kernel



On Fri, 26 Mar 2010, Scott Lurndal wrote:
> 
> I wonder if Intel's EM64 stuff makes this more deterministic, perhaps
> David's implementation would work for x86_64 only?

Limiting it to x86-64 would certainly remove all the worries about all the 
historical x86 clones.

I'd still worry about it for future Intel chips, though. I absolutely 
_detest_ relying on undocumented features - it pretty much always ends up 
biting you eventually. And conditional writeback is actually pretty nasty 
from a microarchitectural standpoint.

			Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:42   ` David Howells
@ 2010-03-26 17:45     ` Linus Torvalds
  2010-03-26 17:58       ` Ralf Baechle
  2010-03-26 17:52     ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2010-03-26 17:45 UTC (permalink / raw)
  To: David Howells; +Cc: mingo, tglx, linux-arch, linux-kernel



On Fri, 26 Mar 2010, David Howells wrote:
> 
> Hmmm...  My ancient Borland Assembler dead-tree manual doesn't mention that.

I went back and checked the old Intel 386 docs from -92 or something, and 
it was "undefined" in there too. So at least Intel seems to have been very 
consistent on this.

That said, maybe all implementations actually do the "don't touch" thing. 

But I do have this memory of us doing this ten+ years ago, though, and 
having to check the ZF after all. Which is why I reacted to the patch in 
the first place and checked the documentation.

			Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:42   ` David Howells
  2010-03-26 17:45     ` Linus Torvalds
@ 2010-03-26 17:52     ` Matthew Wilcox
  1 sibling, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2010-03-26 17:52 UTC (permalink / raw)
  To: David Howells; +Cc: Linus Torvalds, mingo, tglx, linux-arch, linux-kernel

On Fri, Mar 26, 2010 at 05:42:05PM +0000, David Howells wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Where did you find that "doesn't modify its output" thing? It's not true. 
> > The truth is that the destination is undefined. Just read the dang Intel 
> > documentation, it's very clearly stated right there.
> 
> Hmmm...  My ancient Borland Assembler dead-tree manual doesn't mention that.

The only x86 manuals I have are encased in the Itanium SDM from 2002,
and they agree with Linus that BSF and BSR undefine the contents of the
destination operand if the source operand contains zero.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:45     ` Linus Torvalds
@ 2010-03-26 17:58       ` Ralf Baechle
  2010-03-26 18:03         ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Ralf Baechle @ 2010-03-26 17:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Howells, mingo, tglx, linux-arch, linux-kernel

On Fri, Mar 26, 2010 at 10:45:05AM -0700, Linus Torvalds wrote:

> I went back and checked the old Intel 386 docs from -92 or something, and 
> it was "undefined" in there too. So at least Intel seems to have been very 
> consistent on this.
> 
> That said, maybe all implementations actually do the "don't touch" thing. 
> 
> But I do have this memory of us doing this ten+ years ago, though, and 
> having to check the ZF after all. Which is why I reacted to the patch in 
> the first place and checked the documentation.

My trusty old 486 book [1] in the remarks about the BSF instruction:

"The documentation on the 80386 and 80486 states that op1 is undefined if
op2 is 0.  In reality the 80386 will leave the value in op1 unchanged.
The first versions of the 80486 will change op1 to an undefined value.
Later version again will leave it unchanged."

[1] Die Intel Familie in German language, by Robert Hummel, 1992

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:58       ` Ralf Baechle
@ 2010-03-26 18:03         ` Linus Torvalds
  2010-03-26 18:16           ` Matthew Wilcox
                             ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Linus Torvalds @ 2010-03-26 18:03 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: David Howells, mingo, tglx, linux-arch, linux-kernel



On Fri, 26 Mar 2010, Ralf Baechle wrote:
> 
> My trusty old 486 book [1] in the remarks about the BSF instruction:
> 
> "The documentation on the 80386 and 80486 states that op1 is undefined if
> op2 is 0.  In reality the 80386 will leave the value in op1 unchanged.
> The first versions of the 80486 will change op1 to an undefined value.
> Later version again will leave it unchanged."
> 
> [1] Die Intel Familie in German language, by Robert Hummel, 1992

Ok, that explains my memory of us having tried this, at least.

But I do wonder if any of the people working for Intel could ask the CPU 
architects whether we could depend on the "don't write" for 64-bit mode. 
If AMD already documents the don't-touch semantics, and if Intel were to 
be ok with documenting it for their 64-bit capable CPU's, we wouldn't then 
need to rely on undefined behavior.

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 18:03         ` Linus Torvalds
@ 2010-03-26 18:16           ` Matthew Wilcox
  2010-04-06 13:30           ` Matthew Wilcox
  2010-04-14 11:49           ` David Howells
  2 siblings, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2010-03-26 18:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ralf Baechle, David Howells, mingo, tglx, linux-arch,
	linux-kernel

On Fri, Mar 26, 2010 at 11:03:09AM -0700, Linus Torvalds wrote:
> On Fri, 26 Mar 2010, Ralf Baechle wrote:
> > 
> > My trusty old 486 book [1] in the remarks about the BSF instruction:
> > 
> > "The documentation on the 80386 and 80486 states that op1 is undefined if
> > op2 is 0.  In reality the 80386 will leave the value in op1 unchanged.
> > The first versions of the 80486 will change op1 to an undefined value.
> > Later version again will leave it unchanged."
> > 
> > [1] Die Intel Familie in German language, by Robert Hummel, 1992
> 
> Ok, that explains my memory of us having tried this, at least.
> 
> But I do wonder if any of the people working for Intel could ask the CPU 
> architects whether we could depend on the "don't write" for 64-bit mode. 
> If AMD already documents the don't-touch semantics, and if Intel were to 
> be ok with documenting it for their 64-bit capable CPU's, we wouldn't then 
> need to rely on undefined behavior.

I'll drop one of them a note.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 18:03         ` Linus Torvalds
  2010-03-26 18:16           ` Matthew Wilcox
@ 2010-04-06 13:30           ` Matthew Wilcox
  2010-04-14 11:49           ` David Howells
  2 siblings, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2010-04-06 13:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ralf Baechle, David Howells, mingo, tglx, linux-arch,
	linux-kernel

On Fri, Mar 26, 2010 at 11:03:09AM -0700, Linus Torvalds wrote:
> On Fri, 26 Mar 2010, Ralf Baechle wrote:
> > "The documentation on the 80386 and 80486 states that op1 is undefined if
> > op2 is 0.  In reality the 80386 will leave the value in op1 unchanged.
> > The first versions of the 80486 will change op1 to an undefined value.
> > Later version again will leave it unchanged."
> > 
> > [1] Die Intel Familie in German language, by Robert Hummel, 1992
> 
> Ok, that explains my memory of us having tried this, at least.
> 
> But I do wonder if any of the people working for Intel could ask the CPU 
> architects whether we could depend on the "don't write" for 64-bit mode. 
> If AMD already documents the don't-touch semantics, and if Intel were to 
> be ok with documenting it for their 64-bit capable CPU's, we wouldn't then 
> need to rely on undefined behavior.

I don't know whether we can get it /documented/, but the architect I
asked said "We'll never get away with reverting to the older behavior,
so in essence the architecture is set to not overwrite."

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 17:42       ` Linus Torvalds
@ 2010-04-06 13:57         ` Jamie Lokier
  2010-04-06 14:40           ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Jamie Lokier @ 2010-04-06 13:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Scott Lurndal, David Howells, mingo, tglx, linux-arch,
	linux-kernel

Linus Torvalds wrote:
> On Fri, 26 Mar 2010, Scott Lurndal wrote:
> > 
> > I wonder if Intel's EM64 stuff makes this more deterministic, perhaps
> > David's implementation would work for x86_64 only?
> 
> Limiting it to x86-64 would certainly remove all the worries about all the 
> historical x86 clones.
> 
> I'd still worry about it for future Intel chips, though. I absolutely 
> _detest_ relying on undocumented features - it pretty much always ends up 
> biting you eventually. And conditional writeback is actually pretty nasty 
> from a microarchitectural standpoint.

On the same subject of relying on undocumented features:

  /* If SMP and !X86_PPRO_FENCE. */
  #define smp_rmb()      barrier()

I've seen documentation, links posted to lkml ages ago, which implies
this is fine on 64-bit for both Intel and AMD.

But it appears to be relying on undocumented behaviour on 32-bit...

Are you sure it is ok?  Has anyone from Intel/AMD ever confirmed it is
ok?  Has it been tested?  Clones?

-- Jamie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-04-06 13:57         ` Jamie Lokier
@ 2010-04-06 14:40           ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2010-04-06 14:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Scott Lurndal, David Howells, mingo, tglx, linux-arch,
	linux-kernel



On Tue, 6 Apr 2010, Jamie Lokier wrote:
> 
> On the same subject of relying on undocumented features:
> 
>   /* If SMP and !X86_PPRO_FENCE. */
>   #define smp_rmb()      barrier()
> 
> I've seen documentation, links posted to lkml ages ago, which implies
> this is fine on 64-bit for both Intel and AMD.
> 
> But it appears to be relying on undocumented behaviour on 32-bit...

That memory ordering whitepaper is very much supposed to cover all the 
32-bit CPU's too. The people involved were convinced that neither AMD nor 
Intel had ever produced anything that would do anything that broke the 
rules. 

In fact, at least the Intel "memory ordering whitepaper" doesn't even 
exist any more. Go to intel.com and search, and you'll find:

  "Intel® 64 Architecture Memory Ordering White Paper

   This document has been merged into Volume 3A of Intel 64 and IA-32 
   Architectures Software Developers Manual."

which makes it pretty clear that it's not a 64-bit vs 32-bit issue.

> Are you sure it is ok?  Has anyone from Intel/AMD ever confirmed it is
> ok?  Has it been tested?  Clones?

No clones need apply - nobody ever did very aggressive memory re-ordering, 
and clones generally never did SMP either.

There is a VIA chip (I think) that had some relaxed cache mode, but that 
needed a cr4 bit enable or similar, and since it wasn't SMP it only 
mattered for DMA (and possibly nontemporal stores).

Anyway, it all boils down to: yes, we can depend on the memory ordering.

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 18:03         ` Linus Torvalds
  2010-03-26 18:16           ` Matthew Wilcox
  2010-04-06 13:30           ` Matthew Wilcox
@ 2010-04-14 11:49           ` David Howells
  2010-04-14 14:30             ` Avi Kivity
  2010-04-15  8:48             ` David Howells
  2 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2010-04-14 11:49 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox
  Cc: dhowells, Ralf Baechle, mingo, tglx, linux-arch, linux-kernel

Matthew Wilcox <matthew@wil.cx> wrote:

> I don't know whether we can get it /documented/, but the architect I
> asked said "We'll never get away with reverting to the older behavior,
> so in essence the architecture is set to not overwrite."

Does that mean we can rely on it?  Linus?

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-03-26 14:42 ` David Howells
  2010-03-26 17:23   ` Linus Torvalds
  2010-03-26 17:42   ` David Howells
@ 2010-04-14 13:13   ` David Howells
  2 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2010-04-14 13:13 UTC (permalink / raw)
  To: P; +Cc: torvalds, matthew, arjan, linux-arch, linux-kernel, dhowells

Pádraig Brady <P@draigBrady.com> wrote:

> Benchmarks would be useful for this patch set.

Okay.

Using the attached test program:

	warthog>time ./get_order 
	real    1m37.191s
	user    1m36.313s
	sys     0m0.861s
	warthog>time ./get_order x
	real    0m16.892s
	user    0m16.586s
	sys     0m0.287s
	warthog>time ./get_order x x
	real    0m7.731s
	user    0m7.727s
	sys     0m0.002s

Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].

Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.

I ran the above on my Core2 desktop box running x86_64 Fedora 12.

Also note that I compiled the test program with -O3, so I had to do things to
prevent gcc from optimising the call to fls64() or get_order() away, such as
adding up the results and sticking them in a global variable, and not having
too few values passed to get_order(), lest gcc calculate them in advance.

So it would be useful to decide if we can optimise fls() and fls64() for
x86_64.  Certainly it would be useful to replace the out-of-line get_order()
for x86_64.

David
---
#include <stdlib.h>
#include <stdio.h>

#ifndef __x86_64__
#error
#endif

#define BITS_PER_LONG 64

#define PAGE_SHIFT 12

typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline	__attribute__((noinline))

static __always_inline int fls64(__u64 x)
{
	long bitpos = -1;

	asm("bsrq %1,%0"
	    : "+r" (bitpos)
	    : "rm" (x));
	return bitpos + 1;
}

static inline unsigned long __fls(unsigned long word)
{
	asm("bsr %1,%0"
	    : "=r" (word)
	    : "rm" (word));
	return word;
}
static __always_inline int old_fls64(__u64 x)
{
	if (x == 0)
		return 0;
	return __fls(x) + 1;
}

static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
	int order;

	size = (size - 1) >> (PAGE_SHIFT - 1);
	order = -1;
	do {
		size >>= 1;
		order++;
	} while (size);
	return order;
}

static inline __attribute__((const))
int __get_order_old_fls64(unsigned long size)
{
	int order;
	size--;
	size >>= PAGE_SHIFT;
	order = old_fls64(size);
	return order;
}

static inline __attribute__((const))
int __get_order(unsigned long size)
{
	int order;
	size--;
	size >>= PAGE_SHIFT;
	order = fls64(size);
	return order;
}

#define get_order_old_fls64(n)						\
	(								\
		__get_order_old_fls64(n)				\
	)

#define get_order(n)							\
	(								\
		__get_order(n)						\
	)

unsigned long prevent_optimise_out;

static noinline unsigned long test_old_get_order(void)
{
	unsigned long n, total = 0;
	long rep, loop;

	for (rep = 1000000; rep > 0; rep--) {
		for (loop = 0; loop <= 16384; loop += 4) {
			n = 1UL << loop;
			total += old_get_order(n);
		}
	}
	return total;
}

static noinline unsigned long test_get_order_old_fls64(void)
{
	unsigned long n, total = 0;
	long rep, loop;

	for (rep = 1000000; rep > 0; rep--) {
		for (loop = 0; loop <= 16384; loop += 4) {
			n = 1UL << loop;
			total += get_order_old_fls64(n);
		}
	}
	return total;
}

static noinline unsigned long test_get_order(void)
{
	unsigned long n, total = 0;
	long rep, loop;

	for (rep = 1000000; rep > 0; rep--) {
		for (loop = 0; loop <= 16384; loop += 4) {
			n = 1UL << loop;
			total += get_order(n);
		}
	}
	return total;
}

int main(int argc, char **argv)
{
	unsigned long total;

	switch (argc) {
	case 1:  total = test_old_get_order();		break;
	case 2:  total = test_get_order_old_fls64();	break;
	default: total = test_get_order();		break;
	}
	prevent_optimise_out = total;
	return 0;
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-04-14 11:49           ` David Howells
@ 2010-04-14 14:30             ` Avi Kivity
  2010-04-15  8:48             ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Avi Kivity @ 2010-04-14 14:30 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Matthew Wilcox, Ralf Baechle, mingo, tglx,
	linux-arch, linux-kernel

On 04/14/2010 02:49 PM, David Howells wrote:
> Matthew Wilcox<matthew@wil.cx>  wrote:
>
>    
>> I don't know whether we can get it /documented/, but the architect I
>> asked said "We'll never get away with reverting to the older behavior,
>> so in essence the architecture is set to not overwrite."
>>      
> Does that mean we can rely on it?  Linus?
>    

Even if Intel processors behave that way, other processors (real and 
emulated) use those manuals as a specification.  Emulated processors are 
unlikely to touch an undefined register, but real processors may.

(qemu tcg appears not to touch the output)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-04-14 11:49           ` David Howells
  2010-04-14 14:30             ` Avi Kivity
@ 2010-04-15  8:48             ` David Howells
  2010-04-15  8:49               ` Avi Kivity
  1 sibling, 1 reply; 24+ messages in thread
From: David Howells @ 2010-04-15  8:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: dhowells, Linus Torvalds, Matthew Wilcox, Ralf Baechle, mingo,
	tglx, linux-arch, linux-kernel

Avi Kivity <avi@redhat.com> wrote:

> Even if Intel processors behave that way, other processors (real and
> emulated) use those manuals as a specification.  Emulated processors are
> unlikely to touch an undefined register, but real processors may.
> 
> (qemu tcg appears not to touch the output)

Possibly because the AMD64 spec specifies that the destination will be
unchanged if the source was 0.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-04-15  8:48             ` David Howells
@ 2010-04-15  8:49               ` Avi Kivity
  2010-04-15 11:41                 ` Jamie Lokier
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2010-04-15  8:49 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Matthew Wilcox, Ralf Baechle, mingo, tglx,
	linux-arch, linux-kernel

On 04/15/2010 11:48 AM, David Howells wrote:
> Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> Even if Intel processors behave that way, other processors (real and
>> emulated) use those manuals as a specification.  Emulated processors are
>> unlikely to touch an undefined register, but real processors may.
>>
>> (qemu tcg appears not to touch the output)
>>      
> Possibly because the AMD64 spec specifies that the destination will be
> unchanged if the source was 0.
>    

Likely.  But we haven't tested all current and future x86 clones, and 
they may be based off the Intel documentation instead of the AMD 
documentation.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] X86: Optimise fls(), ffs() and fls64()
  2010-04-15  8:49               ` Avi Kivity
@ 2010-04-15 11:41                 ` Jamie Lokier
  0 siblings, 0 replies; 24+ messages in thread
From: Jamie Lokier @ 2010-04-15 11:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David Howells, Linus Torvalds, Matthew Wilcox, Ralf Baechle,
	mingo, tglx, linux-arch, linux-kernel

Avi Kivity wrote:
> Likely.  But we haven't tested all current and future x86 clones, and 
> they may be based off the Intel documentation instead of the AMD 
> documentation.

I wonder about that too.  I got the impression Transmeta did lots of
testing real x86s in all sorts of corner cases, because the manuals
don't cover everything that the broad base of software depends on in
practice.  Clone makers have to do it to a much higher standard than
emulators because you can't generally release patches...

I think Via (including whatever the CPU line was formerly called)
have been bitten a few times by not quite matching software
expectations.

Even Intel was caught on x86_64 at the beginning by slight differences
when they cloned AMD's design :-)

-- Jamie

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-04-15 11:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-13 19:39 [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() David Howells
2010-01-13 19:39 ` [PATCH 2/3] Adjust the comment on get_order() to describe the size==0 case David Howells
2010-01-13 19:39 ` [PATCH 3/3] Optimise get_order() David Howells
2010-01-13 20:15 ` [PATCH 1/3] X86: Optimise fls(), ffs() and fls64() Geert Uytterhoeven
2010-01-13 21:59 ` David Howells
     [not found] <4BACCB4E.7010108@draigBrady.com>
2010-03-26 14:42 ` David Howells
2010-03-26 17:23   ` Linus Torvalds
2010-03-26 17:37     ` Scott Lurndal
2010-03-26 17:42       ` Linus Torvalds
2010-04-06 13:57         ` Jamie Lokier
2010-04-06 14:40           ` Linus Torvalds
2010-03-26 17:42   ` David Howells
2010-03-26 17:45     ` Linus Torvalds
2010-03-26 17:58       ` Ralf Baechle
2010-03-26 18:03         ` Linus Torvalds
2010-03-26 18:16           ` Matthew Wilcox
2010-04-06 13:30           ` Matthew Wilcox
2010-04-14 11:49           ` David Howells
2010-04-14 14:30             ` Avi Kivity
2010-04-15  8:48             ` David Howells
2010-04-15  8:49               ` Avi Kivity
2010-04-15 11:41                 ` Jamie Lokier
2010-03-26 17:52     ` Matthew Wilcox
2010-04-14 13:13   ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).