[PATCH] speed up on find_first_bit for i386 (let compiler do the work)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28  3:51           ` Nick Piggin
@ 2005-07-28 11:43             ` Steven Rostedt
  2005-07-28 12:45               ` Steven Rostedt
                                 ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Steven Rostedt @ 2005-07-28 11:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, LKML, Daniel Walker

[-- Attachment #1: Type: text/plain, Size: 5133 bytes --]

In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO
configurable" I discovered that a C version of find_first_bit is faster
than the asm version now when compiled against gcc 3.3.6 and gcc 4.0.1
(both from versions of Debian unstable).  I wrote a benchmark (attached)
that runs the code 1,000,000 times. First I do it with no bits set,
followed by the last bit set, a middle bit set and then the first bit
set.  Only when no bits are set is the asm version of find_first_bit
faster and that is only when I ran it with gcc 4.0.1 (is gcc at fault
here?). I haven't spent any time actually looking at what gcc produces.
I only looked at the measurements.

I compiled this with "gcc -O2 -o ffb ffb.c".  And here's the output.

On an AMD 2.2GHz SMP machine (SMP shouldn't affect the result here).
This is running gcc 4.0.1

/* comments embedded */

/* ticks match speed */
clock speed = 00000000:7f31d02a 2133970986 ticks per second

/* generic ffb (or just ffb) is the code that is currently in the system
 * my ffb (or just my) is my new version that I'm submitting.
 * my ffb 2 (or just my2) is my version playing with unlikely around an
 *   condition.
 */
no bit set
ffb=320  my=320 my2=320
generic ffb: 00000000:02e33f90
time: 0.022702922us
my ffb: 00000000:02ee66e7
time: 0.023045461us
my ffb 2: 00000000:032d63b9
time: 0.024979860us
/*
 * The above shows that the original beats my version when no bit is
 * set.
 */

last bit set
ffb=319  my=319 my2=319
generic ffb: 00000000:0382a116
time: 0.027597643us
my ffb: 00000000:0204c4a9
time: 0.015870375us
my ffb 2: 00000000:03244a1b
time: 0.024700391us
/*
 * Here we see that there's quite an improvement over normal ffb when
 * the last bit is set.
 */

middle bit set
ffb=159  my=159 my2=159
generic ffb: 00000000:02ce2b78
time: 0.022055584us
my ffb: 00000000:01241c5b
time: 0.008970962us
my ffb 2: 00000000:016171ff
time: 0.010854596us
/*
 * Again, there's quite an improvement when a middle bit is set.
 */

first bit set
ffb=0  my=0 my2=0
generic ffb: 00000000:0232456a
time: 0.017267808us
my ffb: 00000000:003dd354
time: 0.001898712us
my ffb 2: 00000000:009d1f74
time: 0.004825372us
/*
 * When the first bit is set, there's ever a greater improvement. 
 */


Now for the results on my laptop with a Pentium 4 HT 3.3GZ. Running 
gcc 3.3.6

clock speed = 00000000:c5de80ef 3319693551 ticks per second

no bit set
ffb=320  my=320 my2=320
generic ffb: 00000000:0aba64db
time: 0.054218162us
my ffb: 00000000:055f6c73
time: 0.027153036us
my ffb 2: 00000000:052e753e
time: 0.026186379us
/*
 * Now we see even when no bits are set, my version beats the asm one.
 */

last bit set
ffb=319  my=319 my2=319
generic ffb: 00000000:0b69c638
time: 0.057680447us
my ffb: 00000000:050a27fb
time: 0.025469722us
my ffb 2: 00000000:04d32d78
time: 0.024384359us

middle bit set
ffb=159  my=159 my2=159
generic ffb: 00000000:0a1bc81f
time: 0.051086903us
my ffb: 00000000:020f3d7d
time: 0.010408554us
my ffb 2: 00000000:0324112a
time: 0.015873555us

first bit set
ffb=0  my=0 my2=0
generic ffb: 00000000:095a794d
time: 0.047270700us
my ffb: 00000000:005af2d0
time: 0.001795467us
my ffb 2: 00000000:005a0537
time: 0.001777144us


With this evidence, I present my patch against the 2.6.12.2 kernel.

Signed-off-by:  Steven Rostedt <rostedt@goodmis.org>

Index: vanilla_kernel/include/asm-i386/bitops.h
===================================================================
--- vanilla_kernel/include/asm-i386/bitops.h	(revision 263)
+++ vanilla_kernel/include/asm-i386/bitops.h	(working copy)
@@ -311,6 +311,20 @@
 int find_next_zero_bit(const unsigned long *addr, int size, int offset);
 
 /**
+ * __ffs - find first bit in word.
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __ffs(unsigned long word)
+{
+	__asm__("bsfl %1,%0"
+		:"=r" (word)
+		:"rm" (word));
+	return word;
+}
+
+/**
  * find_first_bit - find the first set bit in a memory region
  * @addr: The address to start the search at
  * @size: The maximum size to search
@@ -320,22 +334,16 @@
  */
 static inline int find_first_bit(const unsigned long *addr, unsigned size)
 {
-	int d0, d1;
-	int res;
-
-	/* This looks at memory. Mark it volatile to tell gcc not to move it around */
-	__asm__ __volatile__(
-		"xorl %%eax,%%eax\n\t"
-		"repe; scasl\n\t"
-		"jz 1f\n\t"
-		"leal -4(%%edi),%%edi\n\t"
-		"bsfl (%%edi),%%eax\n"
-		"1:\tsubl %%ebx,%%edi\n\t"
-		"shll $3,%%edi\n\t"
-		"addl %%edi,%%eax"
-		:"=a" (res), "=&c" (d0), "=&D" (d1)
-		:"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory");
-	return res;
+	int x = 0;
+	do {
+		if (*addr)
+			return __ffs(*addr) + x;
+		addr++;
+		if (x >= size)
+			break;
+		x += 32;
+	} while (1);
+	return x;
 }
 
 /**
@@ -360,20 +368,6 @@
 	return word;
 }
 
-/**
- * __ffs - find first bit in word.
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static inline unsigned long __ffs(unsigned long word)
-{
-	__asm__("bsfl %1,%0"
-		:"=r" (word)
-		:"rm" (word));
-	return word;
-}
-
 /*
  * fls: find last bit set.
  */


[-- Attachment #2: ffb.c --]
[-- Type: text/x-csrc, Size: 3303 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define unlikely(x)	__builtin_expect(!!(x), 0)

static inline int find_first_bit(const unsigned long *addr, unsigned size)
{
	int d0, d1;
	int res;

	/* This looks at memory. Mark it volatile to tell gcc not to move it around */
	__asm__ __volatile__(
		"xorl %%eax,%%eax\n\t"
		"repe; scasl\n\t"
		"jz 1f\n\t"
		"leal -4(%%edi),%%edi\n\t"
		"bsfl (%%edi),%%eax\n"
		"1:\tsubl %%ebx,%%edi\n\t"
		"shll $3,%%edi\n\t"
		"addl %%edi,%%eax"
		:"=a" (res), "=&c" (d0), "=&D" (d1)
		:"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory");
	return res;
}

static inline unsigned long __ffs(unsigned long word)
{
	__asm__("bsfl %1,%0"
		:"=r" (word)
		:"rm" (word));
	return word;
}

static inline int my_find_first_bit(const unsigned long *b, unsigned size)
{
	int x = 0;
	do {
		if (*b)
			return __ffs(*b) + x;
		b++;
		if (x >= size)
			break;
		x += 32;
	} while (1);
	return x;
}

static inline int my_find_first_bit2(const unsigned long *b, unsigned size)
{
	int x = 0;
	do {
		if (unlikely(*b))
			return __ffs(*b) + x;
		b++;
		if (x >= size)
			break;
		x += 32;
	} while (1);
	return x;
}
#define rdtscll(val) \
	     __asm__ __volatile__("rdtsc" : "=A" (val))

#define rdtsc(low,high) \
	     __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))

#define BITSIZE 310
static unsigned long array[((BITSIZE)>>5)+1];

#define ITER 1000000 /* 1,000,000 times */

void testit(unsigned long *array, unsigned long long clock)
{
	unsigned long long s;
	unsigned long long e;
	unsigned long long t;
	double f;
	int i;
	int x;
	
	/*
	 * Since ITER is 1,000,000 the times will be in us.
	 */

	/*
	 * Make sure that the output is correct.
	 */
	printf("ffb=%d  my=%d my2=%d\n",
			find_first_bit(array,BITSIZE),
			my_find_first_bit(array,BITSIZE),
			my_find_first_bit2(array,BITSIZE));

	rdtscll(s);
	for (i=0; i < ITER; i++) 
		x = find_first_bit(array,BITSIZE);
	rdtscll(e);
	t = e - s;
	f = (float)t / (float)clock;
	printf("generic ffb: %08lx:%08lx\n",
			(unsigned long)(t>>32),(unsigned long)t);
	printf("time: %.09fus\n",f);

	rdtscll(s);
	for (i=0; i < ITER; i++) 
		x = my_find_first_bit(array,BITSIZE);
	rdtscll(e);
	t = e - s;
	f = (float)t / (float)clock;
	printf("my ffb: %08lx:%08lx\n",
			(unsigned long)(t>>32),(unsigned long)t);
	printf("time: %.09fus\n",f);

	rdtscll(s);
	for (i=0; i < ITER; i++) 
		x = my_find_first_bit2(array,BITSIZE);
	rdtscll(e);
	t = e - s;
	f = (float)t / (float)clock;
	printf("my ffb 2: %08lx:%08lx\n",
			(unsigned long)(t>>32),(unsigned long)t);
	printf("time: %.09fus\n",f);
}

int main(int argc, char **argv)
{
	unsigned long long s;
	unsigned long long e;
	unsigned long long t;
	unsigned long long clock;

	/*
	 * Calculate BS time, just to get an 
	 * idea of the tsc speed.
	 */
	rdtscll(s);
	sleep(8);
	rdtscll(e);
	
	t = e - s;
	t >>= 3;
	printf("clock speed = %08lx:%08lx %llu ticks per second\n",
			(unsigned long)(t>>32),(unsigned long)t,
			t);
	clock = t;

	printf("\nno bit set\n");
	testit(array,clock);

	array[BITSIZE>>5] = 0x80000000;
	printf("\nlast bit set\n");
	testit(array,clock);

	array[4] = 0x80000000;
	printf("\nmiddle bit set\n");
	testit(array,clock);

	array[0] = 0x00000001;
	printf("\nfirst bit set\n");
	testit(array,clock);
	
	exit(0);
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 11:43             ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt
@ 2005-07-28 12:45               ` Steven Rostedt
  2005-07-28 15:31                 ` Linus Torvalds
  2005-07-28 15:30               ` Linus Torvalds
  2005-07-28 17:52               ` Mitchell Blank Jr
  2 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2005-07-28 12:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Walker, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar

[snip]
>  static inline int find_first_bit(const unsigned long *addr, unsigned size)
>  {
[snip]
> +	int x = 0;
> +	do {
> +		if (*addr)
> +			return __ffs(*addr) + x;
> +		addr++;
> +		if (x >= size)
> +			break;
> +		x += 32;
The 32 looks like it may be problamatic.  Is there any i386 64 bit
machines.  Or is hard coding 32 OK?

> +	} while (1);
> +	return x;
>  }
>  

Just in case, I've updated the patch to use (sizeof(*addr)<<3)

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: vanilla_kernel/include/asm-i386/bitops.h
===================================================================
--- vanilla_kernel/include/asm-i386/bitops.h	(revision 263)
+++ vanilla_kernel/include/asm-i386/bitops.h	(working copy)
@@ -311,6 +311,20 @@
 int find_next_zero_bit(const unsigned long *addr, int size, int offset);
 
 /**
+ * __ffs - find first bit in word.
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __ffs(unsigned long word)
+{
+	__asm__("bsfl %1,%0"
+		:"=r" (word)
+		:"rm" (word));
+	return word;
+}
+
+/**
  * find_first_bit - find the first set bit in a memory region
  * @addr: The address to start the search at
  * @size: The maximum size to search
@@ -320,22 +334,16 @@
  */
 static inline int find_first_bit(const unsigned long *addr, unsigned size)
 {
-	int d0, d1;
-	int res;
-
-	/* This looks at memory. Mark it volatile to tell gcc not to move it around */
-	__asm__ __volatile__(
-		"xorl %%eax,%%eax\n\t"
-		"repe; scasl\n\t"
-		"jz 1f\n\t"
-		"leal -4(%%edi),%%edi\n\t"
-		"bsfl (%%edi),%%eax\n"
-		"1:\tsubl %%ebx,%%edi\n\t"
-		"shll $3,%%edi\n\t"
-		"addl %%edi,%%eax"
-		:"=a" (res), "=&c" (d0), "=&D" (d1)
-		:"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory");
-	return res;
+	int x = 0;
+	do {
+		if (*addr)
+			return __ffs(*addr) + x;
+		addr++;
+		if (x >= size)
+			break;
+		x += (sizeof(*addr)<<3);
+	} while (1);
+	return x;
 }
 
 /**
@@ -360,20 +368,6 @@
 	return word;
 }
 
-/**
- * __ffs - find first bit in word.
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static inline unsigned long __ffs(unsigned long word)
-{
-	__asm__("bsfl %1,%0"
-		:"=r" (word)
-		:"rm" (word));
-	return word;
-}
-
 /*
  * fls: find last bit set.
  */



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 11:43             ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt
  2005-07-28 12:45               ` Steven Rostedt
@ 2005-07-28 15:30               ` Linus Torvalds
  2005-07-28 15:47                 ` Steven Rostedt
  2005-07-28 17:52               ` Mitchell Blank Jr
  2 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2005-07-28 15:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker



On Thu, 28 Jul 2005, Steven Rostedt wrote:
>
> In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO
> configurable" I discovered that a C version of find_first_bit is faster
> than the asm version now when compiled against gcc 3.3.6 and gcc 4.0.1
> (both from versions of Debian unstable).  I wrote a benchmark (attached)
> that runs the code 1,000,000 times.

I suspect the old "rep scas" has always been slower than 
compiler-generated code, at least under your test conditions. Many of the 
old asm's are actually _very_ old, and some of them come from pre-0.01 
days and are more about me learning the i386 (and gcc inline asm).

That said, I don't much like your benchmarking methodology. I suspect that 
quite often, the code in question runs from L2 cache, not in a tight loop, 
and so that "run a million times" approach is not necessarily the best 
one.

I'll apply this one as obvious: I doubt the compiler generates bigger code
or has any real downsides, but I just wanted to say that in general I just
wish people didn't always time the hot-cache case ;)

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 12:45               ` Steven Rostedt
@ 2005-07-28 15:31                 ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-07-28 15:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nick Piggin, Daniel Walker, LKML, Andrew Morton, Ingo Molnar



On Thu, 28 Jul 2005, Steven Rostedt wrote:
>
> The 32 looks like it may be problamatic.  Is there any i386 64 bit
> machines.  Or is hard coding 32 OK?

We have BITS_PER_LONG exactly for this usage, but the sizeof also works. 

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 15:30               ` Linus Torvalds
@ 2005-07-28 15:47                 ` Steven Rostedt
  2005-07-28 16:34                   ` Maciej W. Rozycki
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2005-07-28 15:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker

On Thu, 2005-07-28 at 08:30 -0700, Linus Torvalds wrote:
> 
> I suspect the old "rep scas" has always been slower than 
> compiler-generated code, at least under your test conditions. Many of the 
> old asm's are actually _very_ old, and some of them come from pre-0.01 
> days and are more about me learning the i386 (and gcc inline asm).

I've been playing with different approaches, (still all hot cache
though), and inspecting the generated code. It's not that the gcc
generated code is always better for the normal case. But since it sees
more and everything is not hidden in asm, it can optimise what is being
used, and how it's used.

> 
> That said, I don't much like your benchmarking methodology. I suspect that 
> quite often, the code in question runs from L2 cache, not in a tight loop, 
> and so that "run a million times" approach is not necessarily the best 
> one.

Well, I never said I was a test benchmark writer :-).  If you know of a
better way to benchmark these, then let me know.  I also thought that
having all in a hot cache could help with showing the differences.  But
I guess I would need to test this in other ways.
> 
> I'll apply this one as obvious: I doubt the compiler generates bigger code
> or has any real downsides, but I just wanted to say that in general I just
> wish people didn't always time the hot-cache case ;)

I've just finished a version of find_first_zero_bit too.  It has the
same comparisons as the find_first_bit but not as drastic. Do you want
this too, and if so, as a separate patch on top of the first one, or
against 2.6.12.2 (that's the kernel I'm working with right now) or do
you want me to submit a new patch with both changes?

-- Steve



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 15:47                 ` Steven Rostedt
@ 2005-07-28 16:34                   ` Maciej W. Rozycki
  2005-07-28 16:57                     ` Steven Rostedt
                                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Maciej W. Rozycki @ 2005-07-28 16:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Thu, 28 Jul 2005, Steven Rostedt wrote:

> I've been playing with different approaches, (still all hot cache
> though), and inspecting the generated code. It's not that the gcc
> generated code is always better for the normal case. But since it sees
> more and everything is not hidden in asm, it can optimise what is being
> used, and how it's used.

 Since you're considering GCC-generated code for ffs(), ffz() and friends, 
how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as 
apropriate?  Reasonably recent GCC may actually be good enough to use the 
fastest code depending on the processor submodel selected.

  Maciej

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 16:34                   ` Maciej W. Rozycki
@ 2005-07-28 16:57                     ` Steven Rostedt
  2005-07-28 17:25                       ` Linus Torvalds
  2005-07-28 17:17                     ` Linus Torvalds
  2005-07-28 18:25                     ` Steven Rostedt
  2 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2005-07-28 16:57 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Linus Torvalds, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Thu, 2005-07-28 at 17:34 +0100, Maciej W. Rozycki wrote:

>  Since you're considering GCC-generated code for ffs(), ffz() and friends, 
> how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as 
> apropriate?  Reasonably recent GCC may actually be good enough to use the 
> fastest code depending on the processor submodel selected.

I can change the find_first_bit to use __builtin_ffs, but how would you
implement the ffz?  The clz and ctz only count the number of leading or
trailing zeros respectively, it doesn't find the first zero. Of course a
__builtin_ctz(~x) would but this might take longer than what we already
have.  I'll go ahead and try it and see.  But I still don't have a
decent benchmark on this. I'll start looking into the kernel and see how
it's used, and see if I can find a proper benchmark.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 16:34                   ` Maciej W. Rozycki
  2005-07-28 16:57                     ` Steven Rostedt
@ 2005-07-28 17:17                     ` Linus Torvalds
  2005-07-29 15:09                       ` Maciej W. Rozycki
  2005-07-28 18:25                     ` Steven Rostedt
  2 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2005-07-28 17:17 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker



On Thu, 28 Jul 2005, Maciej W. Rozycki wrote:
> 
>  Since you're considering GCC-generated code for ffs(), ffz() and friends, 
> how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as 
> apropriate?

Please don't. Try again in three years when everybody has them.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 16:57                     ` Steven Rostedt
@ 2005-07-28 17:25                       ` Linus Torvalds
  2005-07-29 10:03                         ` David Woodhouse
  2005-07-29 14:39                         ` Maciej W. Rozycki
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-07-28 17:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Maciej W. Rozycki, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Thu, 28 Jul 2005, Steven Rostedt wrote:
> 
> I can change the find_first_bit to use __builtin_ffs, but how would you
> implement the ffz?

The thing is, there are basically _zero_ upsides to using the __builtin_xx 
functions on x86.

There may be more upsides on other architectures (*cough*ia64*cough*) that 
have strange scheduling issues and other complexities, but on x86 in 
particular, the __builtin_xxx() functions tend to be a lot more pain than 
they are worth. Not only do they have strange limitations (on selection of 
opcodes but also for compiler versions), but they aren't well documented, 
and semantics aren't clear.

For example, if you use the "bsfl" inline assembly instruction, you know 
what the semantics are and what the generated code is like: Intel 
documents it, and you know what code you generated. So the special cases 
like "what happens if the input is zero" are well-defined.

In contrast, the gcc builtins probably match some standard that is not 
only harder to find, but also has some _other_ definition for what happens 
for the zero case, so the builtins automatically end up having problems 
due to semantic mis-match between the CPU and the standard.

Basic rule: inline assembly is _better_ than random compiler extensions. 
It's better to have _one_ well-documented extension that is very generic 
than it is to have a thousand specialized extensions.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 11:43             ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt
  2005-07-28 12:45               ` Steven Rostedt
  2005-07-28 15:30               ` Linus Torvalds
@ 2005-07-28 17:52               ` Mitchell Blank Jr
  2 siblings, 0 replies; 23+ messages in thread
From: Mitchell Blank Jr @ 2005-07-28 17:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML

Steven Rostedt wrote:
> In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO
> configurable" I discovered that a C version of find_first_bit is faster
> than the asm version

There are probably other cases of this in asm-i386/bitopts.h.  For instance
I think the "btl" instruction is pretty slow on modern CPUs so
constant_test_bit() will probably outperform variable_test_bit() even if
you feed it a non-constant "nr".  I'd be happy to be proven wrong, though :-)

When testing these optimizations you should also probably check the resulting
vmlinux size.

-Mitch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 16:34                   ` Maciej W. Rozycki
  2005-07-28 16:57                     ` Steven Rostedt
  2005-07-28 17:17                     ` Linus Torvalds
@ 2005-07-28 18:25                     ` Steven Rostedt
  2005-07-28 18:56                       ` Linus Torvalds
  2 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2005-07-28 18:25 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Mitchell Blank Jr, Linus Torvalds, Nick Piggin, Ingo Molnar,
	Andrew Morton, LKML, Daniel Walker

On Thu, 2005-07-28 at 17:34 +0100, Maciej W. Rozycki wrote:
> On Thu, 28 Jul 2005, Steven Rostedt wrote:
> 
> > I've been playing with different approaches, (still all hot cache
> > though), and inspecting the generated code. It's not that the gcc
> > generated code is always better for the normal case. But since it sees
> > more and everything is not hidden in asm, it can optimise what is being
> > used, and how it's used.
> 
>  Since you're considering GCC-generated code for ffs(), ffz() and friends, 
> how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as 
> apropriate?  Reasonably recent GCC may actually be good enough to use the 
> fastest code depending on the processor submodel selected.
> 

OK, I guess when I get some time, I'll start testing all the i386 bitop
functions, comparing the asm with the gcc versions.  Now could someone
explain to me what's wrong with testing hot cache code. Can one
instruction retrieve from memory better than others?  I was trying to
see which whas slower in CPU, but if an algorithm aligns with the cache
or something that is faster, my hot cache testing will not catch that.
But I don't know how to write a test that would test over and over again
something that is not in cache.  It would seem that I would have to find
a way to flush the L1 and L2 cache each time. But it still seems to be
adding too many variables to the equation to get good meaningful
benchmarks.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 18:25                     ` Steven Rostedt
@ 2005-07-28 18:56                       ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-07-28 18:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Maciej W. Rozycki, Mitchell Blank Jr, Nick Piggin, Ingo Molnar,
	Andrew Morton, LKML, Daniel Walker

On Thu, 28 Jul 2005, Steven Rostedt wrote:
> 
> OK, I guess when I get some time, I'll start testing all the i386 bitop
> functions, comparing the asm with the gcc versions.  Now could someone
> explain to me what's wrong with testing hot cache code. Can one
> instruction retrieve from memory better than others?

There's a few issues:

 - trivially: code/data size. Being smaller automatically means faster if
   you're cold-cache. If you do cycle tweaking of something that is 
   possibly commonly in the L2 cache or further away, you migt as well
   consider one byte of code-space to be equivalent to one cycle (a L1 I$ 
   miss can easily take 50+ cycles - the L1 fill cost may be just a small 
   part of that, but the pipeline problem it causes can be deadly).

 - branch prediction: cold-cache is _different_ from hot-cache. hit-cache 
   predicts the stuff dynamically, cold-cache has different rules (and it 
   is _usually_ "forward predicts not-taken, backwards predicts taken", 
   although you can add static hints if you want to on most architectures).

   So hot-cache may look very different indeed - the "normal" case might 
   be that you mispredict all the time because the static prediction is 
   wrong, but then a hot-cache benchmark will predict perfectly.

 - access patterns. This only matters if you look at algorithmic changes. 
   Hashes have atrocious locality, but on the other hand, if you know that 
   the access pattern is cold, a hash will often have a minimum number of 
   accesses. 

but no, you don't have "some instructions are better at reading from 
memory" for regular integer code (FP often has other issues, like reading 
directly from L2 without polluting L1, and then there are obviously 
prefetch hints).

Now, in the case of your "rep scas" conversion, the reason I applied it
was that it was obviously a clear win (rep scas is known bad, and has
register allocation issues too), so I'm _not_ claiming that the above
issues were true in that case. I just wanted to say that in general it's 
nice (but often quite hard) if you can give cold-cache numbers too (for 
example, using the cycle counter and being clever can actually give that).

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 17:25                       ` Linus Torvalds
@ 2005-07-29 10:03                         ` David Woodhouse
  2005-07-29 14:41                           ` Maciej W. Rozycki
  2005-07-29 16:23                           ` Linus Torvalds
  2005-07-29 14:39                         ` Maciej W. Rozycki
  1 sibling, 2 replies; 23+ messages in thread
From: David Woodhouse @ 2005-07-29 10:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Maciej W. Rozycki, Nick Piggin, Ingo Molnar,
	Andrew Morton, LKML, Daniel Walker

On Thu, 2005-07-28 at 10:25 -0700, Linus Torvalds wrote:
> Basic rule: inline assembly is _better_ than random compiler extensions. 
> It's better to have _one_ well-documented extension that is very generic 
> than it is to have a thousand specialized extensions.

Counterexample: FR-V and its __builtin_read8() et al. For FR-V you have
to issue a memory barrier before or after certain I/O instructions, but
in some circumstances you can omit them. The compiler knows this and can
omit the membar instructions as appropriate -- but doing the same
optimisations in inline assembly would be fairly much impossible.

Builtins can also allow the compiler more visibility into what's going
on and more opportunity to optimise. They can also set condition
registers, which you can't do from inline assembly -- if you want to
perform a test in inline asm, you have to put the result in a register
and then test the contents of that register. (You can't just branch from
the inline asm either, although we used to try).

Builtins are more portable and their implementation will improve to
match developments in the target CPU. Inline assembly, as we have seen,
remains the same for years while the technology moves on.

Although it's often the case that inline assembly _is_ better,
especially in code which is arch-specific in the first place, I wouldn't
necessarily assume that it's always the case.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
@ 2005-07-29 14:37 linux
  2005-07-29 15:08 ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 23+ messages in thread
From: linux @ 2005-07-29 14:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt

> OK, I guess when I get some time, I'll start testing all the i386 bitop
> functions, comparing the asm with the gcc versions.  Now could someone
> explain to me what's wrong with testing hot cache code. Can one
> instruction retrieve from memory better than others?

To add one to Linus' list, note that all current AMD & Intel chips
record instruction boundaries in L1 cache, either predecoding on
L1 cache load, or marking the boundaries on first execution.

The P4 takes it to an extreme, but P3 and K7/K8 do it too.

The result is that there are additional instruction decode limits
that apply to cold-cache code.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 17:25                       ` Linus Torvalds
  2005-07-29 10:03                         ` David Woodhouse
@ 2005-07-29 14:39                         ` Maciej W. Rozycki
  2005-07-29 16:29                           ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Maciej W. Rozycki @ 2005-07-29 14:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Thu, 28 Jul 2005, Linus Torvalds wrote:

> There may be more upsides on other architectures (*cough*ia64*cough*) that 
> have strange scheduling issues and other complexities, but on x86 in 
> particular, the __builtin_xxx() functions tend to be a lot more pain than 
> they are worth. Not only do they have strange limitations (on selection of 

 They can be buggy, sure, just like any code.  If they indeed are we may 
want to avoid them rather than requiring a GCC upgrade, of course.

> opcodes but also for compiler versions), but they aren't well documented, 
> and semantics aren't clear.

 Hmm, that's what's in the GCC info pages for the relevant functions 
(I've omitted the "l" and "ll" variants):

"-- Built-in Function: int __builtin_ffs (unsigned int x)
     Returns one plus the index of the least significant 1-bit of X, or
     if X is zero, returns zero.

 -- Built-in Function: int __builtin_clz (unsigned int x)
     Returns the number of leading 0-bits in X, starting at the most
     significant bit position.  If X is 0, the result is undefined.

 -- Built-in Function: int __builtin_ctz (unsigned int x)
     Returns the number of trailing 0-bits in X, starting at the least
     significant bit position.  If X is 0, the result is undefined."

If that's not enough, then what would be?  I'm serious -- if you find it 
inadequate, then perhaps it could be improved.

> In contrast, the gcc builtins probably match some standard that is not 
> only harder to find, but also has some _other_ definition for what happens 
> for the zero case, so the builtins automatically end up having problems 
> due to semantic mis-match between the CPU and the standard.

 GCC should know the semantics of underlying CPU instructions used and be 
able to optimize expressions for common cases, e.g. like:

	return x == 0 ? 32 : __builtin_ctz(x);

when the CPU provides a "ctz" operation that returns 32 for 0.

> Basic rule: inline assembly is _better_ than random compiler extensions. 
> It's better to have _one_ well-documented extension that is very generic 
> than it is to have a thousand specialized extensions.

 It depends on how many submodel-specific variants of inline assembly you 
need, how many reloads are required for constraints, possibly defeating 
the gain, etc.

 In this particular case "bsf" and "bsr" are notoriously slow for some 
i386 submodels, so using the generic O(log n) algorithm may result in 
better performance for them.  E.g. the execution time for "bsf" for the 
original i386 is 11 + 3n clock ticks (n refers to the resulting bit 
index), "bsr" for the i486 is 7 - 104 ticks (so again 3n), for Pentium -- 
7 - 72 ticks (2n, then).  It does not immediately mean they are worse, but 
they are slow enough for the pessimistic case checking alternatives is not 
unreasonable.

  Maciej

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-29 10:03                         ` David Woodhouse
@ 2005-07-29 14:41                           ` Maciej W. Rozycki
  2005-07-29 16:23                           ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Maciej W. Rozycki @ 2005-07-29 14:41 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Steven Rostedt, Nick Piggin, Ingo Molnar,
	Andrew Morton, LKML, Daniel Walker

On Fri, 29 Jul 2005, David Woodhouse wrote:

> Builtins are more portable and their implementation will improve to
> match developments in the target CPU. Inline assembly, as we have seen,
> remains the same for years while the technology moves on.
> 
> Although it's often the case that inline assembly _is_ better,
> especially in code which is arch-specific in the first place, I wouldn't
> necessarily assume that it's always the case.

 Well, if some inline assembly is found to be better, then perhaps it 
should be contributed (not necessarily as is, but as a concept) to GCC for 
improvement.

  Maciej

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-29 14:37 linux
@ 2005-07-29 15:08 ` linux-os (Dick Johnson)
  0 siblings, 0 replies; 23+ messages in thread
From: linux-os (Dick Johnson) @ 2005-07-29 15:08 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel, rostedt

On Fri, 29 Jul 2005 linux@horizon.com wrote:

>> OK, I guess when I get some time, I'll start testing all the i386 bitop
>> functions, comparing the asm with the gcc versions.  Now could someone
>> explain to me what's wrong with testing hot cache code. Can one
>> instruction retrieve from memory better than others?
>

Yes! Intel has more than 'load' and 'store' instructions. If
memory is in the cache, the following memory operations are
shown fastest to slowest...

 	movl	(%ebx), %eax		# Index-register indirect. Note that
 					# ebx needs to be loaded so the overall
 					# access might be slower. Also some
 					# index registers are faster on
 					# some CPUs (486-> eax is fastest)
 	movl	(mem), %eax		# Direct from memory into register
 	movl	0x04(%ebx), %eax	# Index-register plus displacment
 	movl	(%esi, %ebx), %eax	# Two register indirect
 	movl	0x04(%esi, %ebx), %eax	# Two register plus displacement

When using 'movl (men), %eax', "mem" is a 32-bit word that is fetched
from the instruction stream while 'movl (%ebx), %eax' is only 2 bytes.
Therefore, if an index register can remain loaded with the correct offset
or manipulated with 'lea', then single-register indirect memory
access is fastest on current ix86 processors.

> To add one to Linus' list, note that all current AMD & Intel chips
> record instruction boundaries in L1 cache, either predecoding on
> L1 cache load, or marking the boundaries on first execution.
>
> The P4 takes it to an extreme, but P3 and K7/K8 do it too.
>
> The result is that there are additional instruction decode limits
> that apply to cold-cache code.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.12 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-28 17:17                     ` Linus Torvalds
@ 2005-07-29 15:09                       ` Maciej W. Rozycki
  0 siblings, 0 replies; 23+ messages in thread
From: Maciej W. Rozycki @ 2005-07-29 15:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Thu, 28 Jul 2005, Linus Torvalds wrote:

> >  Since you're considering GCC-generated code for ffs(), ffz() and friends, 
> > how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as 
> > apropriate?
> 
> Please don't. Try again in three years when everybody has them.

 Well, __builtin_ffs() has been there since at least gcc 2.95.  The two 
others are quite recent, indeed -- apparently only since GCC 3.4.  They 
may still be considered to be used conditionally if there is justified 
benefit.

  Maciej

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-29 10:03                         ` David Woodhouse
  2005-07-29 14:41                           ` Maciej W. Rozycki
@ 2005-07-29 16:23                           ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-07-29 16:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Steven Rostedt, Maciej W. Rozycki, Nick Piggin, Ingo Molnar,
	Andrew Morton, LKML, Daniel Walker

On Fri, 29 Jul 2005, David Woodhouse wrote:
>
> On Thu, 2005-07-28 at 10:25 -0700, Linus Torvalds wrote:
> > Basic rule: inline assembly is _better_ than random compiler extensions. 
> > It's better to have _one_ well-documented extension that is very generic 
> > than it is to have a thousand specialized extensions.
> 
> Counterexample: FR-V and its __builtin_read8() et al.

There are arguably always counter-examples, but your arguments really are 
pretty theoretical.

Very seldom does compiler extensions end up being (a) timely enough and 
(b) semantically close enough to be really useful.

> Builtins can also allow the compiler more visibility into what's going
> on and more opportunity to optimise.

Absolutely. In theory. In practice, not so much. All the opportunity to 
optimize often ends up being lost in semantic clashes, or just because 
people can't use the extension because it hasn't been there since day one.

The fact is, inline asms are pretty rare even when we are talking about
every single possible assembly combination. They are even less common when
we're talking about just _one_ specific case of them (like something like
__builtin_ffs()).

What does this mean? It has two results: (a) instruction-level scheduling 
and register allocation just isn't _that_ important, and the generic "asm" 
register scheduling is really plenty good enough. The fact that in theory 
you might get better results if the compiler knew exactly what was going 
on is just not relevant: in practice it's simply not _true_. The other 
result is: (b) the compiler people don't end up seeing something like the 
esoteric builtins as a primary thing, so it's not like they'd be tweaking 
and regression-testing everything _anyway_.

So I argue very strongly that __builtin_xxx() is _wrong_, unless you have 
very very strong reasons for it:

 - truly generic and _very_ important stuff: __builtin_memcpy() is
   actually very much worth it, since it's all over, and it's so generic 
   that the compiler has a lot of choice in how to do it.

 - stuff where the architecture (or the compiler) -really- sucks with
   inline asms, and has serious problems, and the thing is really 
   important. Your FR-V example _might_ fall into this category (or it 
   might not), and ia64 has the problem with instruction packing and
   scheduling and so __builtin's have a bigger advantage.

Basically, on most normal architectures, there's seldom any reason at
_all_ to use builtins except for things like memcpy. On x86, I think the
counter-example might be if you want to schedule MMX code from C - which
is a special case because it doesn't follow my "rule (a)" above. But we 
don't do that in the kernel, really, or we just schedule it out-of-line.

			Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-29 14:39                         ` Maciej W. Rozycki
@ 2005-07-29 16:29                           ` Linus Torvalds
  2005-07-29 17:14                             ` Maciej W. Rozycki
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2005-07-29 16:29 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Fri, 29 Jul 2005, Maciej W. Rozycki wrote:
> 
>  Hmm, that's what's in the GCC info pages for the relevant functions 
> (I've omitted the "l" and "ll" variants):
> 
> "-- Built-in Function: int __builtin_ffs (unsigned int x)
>      Returns one plus the index of the least significant 1-bit of X, or
>      if X is zero, returns zero.

This, for example, clashes with the x86 semantics.

If X is zero, the bsfl instruction will set the ZF flag, and the result is 
undefined (on many, but not all, CPU's it will either be zero _or_ 
unmodified).

We don't care, since we actually test the input for being zero separately
_anyway_, but my point is that if the builtin is badly done (and I
wouldn't be in the least surprised if it was), then it's going to do a
totally unnecessary conditional jump of cmov.

See? __builtin's can generate _worse_ code, exactly because they try to 
have portable semantics that may not even matter.

In contrast, just doing it by hand allows us to avoid all that crap.

Doing it by hand as inline assembly also allows us to do dynamic 
optimizations like instruction rewriting, so inline assembly is a _lot_ 
more powerful than builtins can reasonably ever be.

> If that's not enough, then what would be?  I'm serious -- if you find it 
> inadequate, then perhaps it could be improved.

It's inadequate because IT IS POINTLESS.

The builtin buys you absolutely _nothing_, and the inline asm is simpler, 
potentially faster, and works with every single version of gcc. 

USING THE BUILTIN IS A PESSIMISATION!

It has absolutely _zero_ upsides, and I've named three _major_ downsides.

It has another downside too: it's extra complexity and potential for bugs 
in the compiler. And if you tell me gcc people never have bugs, I will 
laugh in your general direction.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
  2005-07-29 16:29                           ` Linus Torvalds
@ 2005-07-29 17:14                             ` Maciej W. Rozycki
  0 siblings, 0 replies; 23+ messages in thread
From: Maciej W. Rozycki @ 2005-07-29 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML,
	Daniel Walker

On Fri, 29 Jul 2005, Linus Torvalds wrote:

> It has another downside too: it's extra complexity and potential for bugs 
> in the compiler. And if you tell me gcc people never have bugs, I will 
> laugh in your general direction.

 You mean these that have been sitting in their Bugzilla for some three 
years with no resolution and only occasional scratching of heads? ;-)

  Maciej

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
@ 2005-07-31 16:33 Richard Kennedy
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Kennedy @ 2005-07-31 16:33 UTC (permalink / raw)
  To: linux-kernel

Hi,
FWIW the following routine is consistently slightly faster using
Steven's test harness , with a big win when no bit set.

static inline int new_find_first_bit(const unsigned long *b, unsigned
size)
{
	int x = 0;
	do {
		unsigned long v = *b++;
	  	if (v)
			return __ffs(v) + x;
		if (x >= size)
			break;
		x += 32;
	} while (1);
	return x;
}

Tested on P III M 933MHz / gcc 4.0.1

clock speed = 00000000:17c56980 398813568 ticks per second

no bit set
ffb=320  my=320 new=320
generic ffb: 00000000:02fd6660
time: 0.125776182us
my ffb: 00000000:03c314e9
time: 0.158260714us
new ffb : 00000000:02d9190b
time: 0.119810758us

last bit set
ffb=319  my=319 new=319 
generic ffb: 00000000:04e5900c
time: 0.205994717us
my ffb: 00000000:0327475d
time: 0.132658024us
new ffb: 00000000:02c86938
time: 0.117068655us

middle bit set
ffb=159  my=159 new=159
generic ffb: 00000000:03c2bc56
time: 0.158203865us
my ffb: 00000000:01356b8b
time: 0.050846204us
new ffb: 00000000:0115f133
time: 0.045673521us

first bit set
ffb=0  my=0 new=0 
generic ffb: 00000000:02d07460
time: 0.118390436us
my ffb: 00000000:005d3079
time: 0.015313564us
new ffb: 00000000:005cca07
time: 0.015247804us

Cheers
Richard
Not subscribed please CC -- thanks.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
@ 2005-08-01  2:00 linux
  0 siblings, 0 replies; 23+ messages in thread
From: linux @ 2005-08-01  2:00 UTC (permalink / raw)
  To: richard; +Cc: linux-kernel

> static inline int new_find_first_bit(const unsigned long *b, unsigned size)
> {
> 	int x = 0;
> 	do {
> 		unsigned long v = *b++;
> 		if (v)
> 			return __ffs(v) + x;
> 		if (x >= size)
> 			break;
> 		x += 32;
> 	} while (1);
> 	return x;
> }

Wait a minute... suppose that size == 32 and the bitmap is one word of all
zeros.  Dynamic execution will overflow the buffer:

 	int x = 0;
 		unsigned long v = *b++;	/* Zero */

 		if (v)			/* False, v == 0 */
 		if (x >= size)		/* False, 0 < 32 */
 		x += 32;
 	} while (1);
 		unsigned long v = *b++;	/* Buffer overflow */
 		if (v)			/* Random value, suppose non-zero */
			return __ffs(v) + x;	/* >= 32 */

That should be:
static inline int new_find_first_bit(const unsigned long *b, unsigned size)
	int x = 0;
 	do {
 		unsigned long v = *b++;
 		if (v)
 			return __ffs(v) + x;
	} while ((x += 32) < size);
	return size;
}

Note that we assume that the trailing long is padded with zeros.

In truth, it should probably be either

static inline unsigned new_find_first_bit(u32 const *b, unsigned size)
	int x = 0;
 	do {
 		u32 v = *b++;
 		if (v)
 			return __ffs(v) + x;
	} while ((x += 32) < size);
	return size;
}

or

static inline unsigned
new_find_first_bit(unsigned long const *b, unsigned size)
	unsigned x = 0;
 	do {
 		unsigned long v = *b++;
 		if (v)
 			return __ffs(v) + x;
	} while ((x += CHAR_BIT * sizeof *b) < size);
	return size;
}

Do we actually store bitmaps on 64-bit machines with 32 significant bits
per ulong?

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2005-08-01  2:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-01  2:00 [PATCH] speed up on find_first_bit for i386 (let compiler do the work) linux
  -- strict thread matches above, loose matches on Subject: below --
2005-07-31 16:33 Richard Kennedy
2005-07-29 14:37 linux
2005-07-29 15:08 ` linux-os (Dick Johnson)
2005-07-27 14:13 [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Steven Rostedt
2005-07-28  1:00 ` Daniel Walker
2005-07-28  1:25   ` Steven Rostedt
2005-07-28  3:06     ` Steven Rostedt
2005-07-28  3:32       ` Steven Rostedt
2005-07-28  3:45         ` Steven Rostedt
2005-07-28  3:51           ` Nick Piggin
2005-07-28 11:43             ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt
2005-07-28 12:45               ` Steven Rostedt
2005-07-28 15:31                 ` Linus Torvalds
2005-07-28 15:30               ` Linus Torvalds
2005-07-28 15:47                 ` Steven Rostedt
2005-07-28 16:34                   ` Maciej W. Rozycki
2005-07-28 16:57                     ` Steven Rostedt
2005-07-28 17:25                       ` Linus Torvalds
2005-07-29 10:03                         ` David Woodhouse
2005-07-29 14:41                           ` Maciej W. Rozycki
2005-07-29 16:23                           ` Linus Torvalds
2005-07-29 14:39                         ` Maciej W. Rozycki
2005-07-29 16:29                           ` Linus Torvalds
2005-07-29 17:14                             ` Maciej W. Rozycki
2005-07-28 17:17                     ` Linus Torvalds
2005-07-29 15:09                       ` Maciej W. Rozycki
2005-07-28 18:25                     ` Steven Rostedt
2005-07-28 18:56                       ` Linus Torvalds
2005-07-28 17:52               ` Mitchell Blank Jr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox