public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Daniel Phillips <dphillips@sistina.com>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: <linux-kernel@vger.kernel.org>
Subject: [RFC][PATCH] Faster generic_fls
Date: Wed, 30 Apr 2003 04:46:23 +0200	[thread overview]
Message-ID: <200304300446.24330.dphillips@sistina.com> (raw)

Hi Linus,

Here's a faster implementation of generic_fls, that I discovered accidently,
by not noticing 2.5 already had a generic_fls, and so I rolled my own.  Like
the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
that.  I think the new algorithm is inherently more parallelizable than the
traditional approach.  A processor that can speculatively evaluate both sides 
of a conditional would benefit even more than the PIII I tested on.

The algorithm works roughly as follows: to find the highest bit in a value of
a given size, test the higher half for any one bits, then recursively apply
the algorithm to one of the two halves, depending on the test.  Once down to 8
bits, enumerate all the cases:

   return n & 0xf0? (n & 0xc0? (n & 0x80? 8: 7): (n & 0x20? 6: 5)):
          n & 0x0f? (n & 0x0c? (n & 0x08? 4: 3): (n & 0x02? 2: 1)): 0;

The above expression can be considerably optimized by noting that once we get
down to just two bits, at least one of which is known to be a one bit, it's
faster to shift out the higher of the two bits and add it directly to the
result than to evaluate a conditional.

A sneaky optimization is possible for the lowest two bits: the four values
{0, 1, 2, 3} map directly onto three of the four wanted results {0, 1, 2, 2},
so a little bit bashing takes care of both the conditional mentioned above
and the test that would otherwise be needed for the zero case.  The resulting
optimized code is sufficiently obfuscated for an IOCC entry, but it's fast:

   return n & 0xf0?
       n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
       n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);

In short, to find the high bit of a 32 bit value, the algorithm enumerates
all 32 possibilities using a binary search.  Inlines clarify the recursion,
and as a fringe benefit, give us a handy set of 8, 16, 32 and 64 bit
function flavors, the shorter versions being a little faster if you can use
them.

The thing that makes it fast (I think) is that the expressions at the leaves
can be evaluated in parallel with the conditional tests - that is, it's
possible to compute the results before we know exactly which one is needed. 
Another thing that may contribute to the speed is that the algorithm is doing
relatively more reading than writing, compared to the current version.
Though I did not test it, I believe the speedup will carry over to assembly
implementations as well.

There are still some optimization possibilities remaining.  For example, in
some of the evaluation cases the zero case doesn't have to be evaluated, so
a little arithmetic can be saved.  But then the helper functions wouldn't be
useable as sub-functions in their own right any more, so I don't think the
small speedup is worth it for the decreased orthogonality.

The improvement on a PIII varies from about 1.43x with gcc -O2 to 2.08x at
-O3.  The benchmark runs 2**32 iterations, evaluating all 32 bit cases.
Roughly speaking, at O3 it takes about 10 cycles to do the job:

       Old       New   Empty loop  Actual old  Actual new   Speedup
O2:  111.267   94.129    54.014      57.253      40.115       1.43
O3:   95.875   53.018    13.200      82.675      39.818       2.08

The test program is here:

   http://people.nl.linux.org/~phillips/test_fls.c

The patch is against 2.5.68-bk9.

Regards,

Daniel

--- 2.5.68.clean/include/linux/bitops.h	2003-04-20 04:51:12.000000000 +0200
+++ 2.5.68/include/linux/bitops.h	2003-04-30 01:29:27.000000000 +0200
@@ -41,33 +41,35 @@
  * fls: find last bit set.
  */
 
-extern __inline__ int generic_fls(int x)
+/* Faster generic_fls */
+/* (c) 2002, D.Phillips and Sistina Software */
+/* Licensed under Version 2 of the GPL */
+
+static inline unsigned fls8(unsigned n)
+{
+	return n & 0xf0?
+	    n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
+	    n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
+}
+
+static inline unsigned fls16(unsigned n)
+{
+	return n & 0xff00? fls8(n >> 8) + 8: fls8(n);
+}
+
+static inline unsigned fls32(unsigned n)
+{
+	return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n);
+}
+
+static inline unsigned fls64(unsigned long long n) /* should be u64 */
 {
-	int r = 32;
+	return n & 0xffffffff00000000? fls32(n >> 32) + 32: fls32(n);
+}
 
-	if (!x)
-		return 0;
-	if (!(x & 0xffff0000u)) {
-		x <<= 16;
-		r -= 16;
-	}
-	if (!(x & 0xff000000u)) {
-		x <<= 8;
-		r -= 8;
-	}
-	if (!(x & 0xf0000000u)) {
-		x <<= 4;
-		r -= 4;
-	}
-	if (!(x & 0xc0000000u)) {
-		x <<= 2;
-		r -= 2;
-	}
-	if (!(x & 0x80000000u)) {
-		x <<= 1;
-		r -= 1;
-	}
-	return r;
+static inline int generic_fls(int n)
+{
+	return fls32(n);
 }
 
 extern __inline__ int get_bitmask_order(unsigned int count)


             reply	other threads:[~2003-04-30  2:28 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-04-30  2:46 Daniel Phillips [this message]
2003-04-30  7:19 ` [RFC][PATCH] Faster generic_fls Willy Tarreau
2003-04-30  8:36   ` Aaron Lehmann
2003-04-30  8:43     ` Aaron Lehmann
2003-04-30  8:52     ` Aaron Lehmann
2003-04-30  8:51       ` P
2003-04-30 13:51   ` Daniel Phillips
2003-04-30 11:14 ` Falk Hueffner
2003-04-30 13:03   ` Daniel Phillips
2003-04-30 13:17     ` Falk Hueffner
2003-04-30 14:07       ` Daniel Phillips
2003-04-30 14:11   ` Linus Torvalds
2003-04-30 14:53     ` Falk Hueffner
2003-04-30 15:28       ` Linus Torvalds
2003-04-30 16:03         ` Falk Hueffner
2003-04-30 16:16           ` Linus Torvalds
2003-04-30 16:43             ` Falk Hueffner
2003-04-30 20:25             ` Alan Cox
2003-04-30 21:59               ` Falk Hueffner
2003-04-30 22:22                 ` Alan Cox
2003-04-30 23:41               ` Linus Torvalds
2003-04-30 22:47                 ` Alan Cox
2003-05-01  0:12                 ` Falk Hueffner
2003-05-01  1:07                   ` Linus Torvalds
2003-04-30 19:15     ` Daniel Phillips
2003-04-30 20:59       ` Willy Tarreau
2003-05-01  1:02         ` Daniel Phillips
2003-05-01  9:19           ` Willy Tarreau
2003-05-01 16:02             ` Linus Torvalds
2003-04-30 20:55 ` Andrew Morton
2003-05-01  5:03   ` hugang
2003-05-01  5:11     ` Andrew Morton
2003-05-01  5:33       ` hugang
2003-05-01  7:05         ` hugang
2003-05-01 13:52           ` Willy TARREAU
2003-05-01 14:14             ` Falk Hueffner
2003-05-01 14:26               ` Willy TARREAU
2003-05-01 14:53             ` hugang
2003-05-01 15:54               ` Thomas Schlichter
2003-05-02  0:33                 ` hugang
2003-05-01 17:16               ` Willy TARREAU
2003-05-01 23:27                 ` Thomas Schlichter
2003-05-02  0:10                   ` Willy Tarreau
2003-05-02  0:43                     ` Daniel Phillips
2003-05-02  0:54                       ` Andrew Morton
2003-05-02 12:21                         ` Daniel Phillips
2003-05-02  1:47                       ` Thomas Schlichter
2003-05-02 13:24                         ` Willy Tarreau
2003-05-02  0:13                   ` Daniel Phillips
2003-05-02  0:13             ` Lincoln Dale
2003-05-01  5:16   ` hugang
2003-05-01 10:22     ` Willy TARREAU
2003-05-01 11:17       ` hugang
2003-05-01 11:45         ` Willy TARREAU
     [not found] <87d6j34jad.fsf@student.uni-tuebingen.de.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.44.0304301801210.20283-100000@home.transmeta.com.suse.lists.linux.kernel>
2003-05-01  1:46   ` Andi Kleen
2003-05-01  4:40     ` Linus Torvalds
2003-05-01 12:00       ` David S. Miller
2003-05-02  5:14         ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2003-05-01 13:31 linux
2003-05-01 17:15 Chuck Ebbert
2003-05-02  8:50 Chuck Ebbert
2003-05-02  9:37 ` Peter Osterlund
2003-05-02 10:04 Chuck Ebbert
2003-05-02 22:55 ` David S. Miller
2003-05-03  2:21 Chuck Ebbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200304300446.24330.dphillips@sistina.com \
    --to=dphillips@sistina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox