[patch 47/47] hweight() speedup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Akinobu Mita <mita@miraclelinux.com>
To: linux-kernel@vger.kernel.org
Cc: akpm@osdl.org, linux@horizon.com,
	Balbir Singh <bsingharora@gmail.com>,
	Grant Grundler <iod00d@hp.com>, Gabriel Paubert <paubert@iram.es>
Subject: [patch 47/47] hweight() speedup
Date: Tue, 14 Feb 2006 14:04:38 +0900	[thread overview]
Message-ID: <20060214050450.536209000@localhost.localdomain> (raw)
In-Reply-To: 20060214050351.252615000@localhost.localdomain

[-- Attachment #1: hweight-speedup.patch --]
[-- Type: text/plain, Size: 5731 bytes --]

linux@horizon.com wrote:
> This is an extremely well-known technique.  You can see a similar version
> that uses a multiply for the last few steps at
> http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
> whch refers to 
> "Software Optimization Guide for AMD Athlon 64 and Opteron Processors"
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
> 
> It's section 8.6, "Efficient Implementation of Population-Count Function
> in 32-bit Mode", pages 179-180.
> 
> It uses the name that I am more familiar with, "popcunt" (population count),
> although "Hamming weight" also makes sense.
> 
> Anyway, the proof of correctness proceeds as follows:
> 
> 	b = a - ((a >> 1) & 0x55555555);
> 	c = (b & 0x33333333) + ((b >> 2) & 0x33333333);
> 	d = (c + (c >> 4)) & 0x0f0f0f0f;
> #if SLOW_MULTIPLY
> 	e = d + (d >> 8)
> 	f = e + (e >> 16);
> 	return f & 63;
> #else
> 	/* Useful if multiply takes at most 4 cycles */
> 	return (d * 0x01010101) >> 24;
> #endif
> 
> The input value a can be thought of as 32 1-bit fields each holding
> their own hamming weight.  Now look at it as 16 2-bit fields.
> Each 2-bit field a1..a0 has the value 2*a1 + a0.  This can be converted
> into the hamming weight of the 2-bit field a1+a0 by subtracting a1.
> 
> That's what the (a >> 1) & mask subtraction does.  Since there can be no
> borrows, you can just do it all at once.
> 
> Enumerating the 4 possible cases:
> 
> 0b00 = 0  ->  0 - 0 = 0
> 0b01 = 1  ->  1 - 0 = 1
> 0b10 = 2  ->  2 - 1 = 1
> 0b11 = 3  ->  3 - 1 = 2
> 
> 
> The next step consists of breaking up b (made of 16 2-bir fields) into
> even and odd halves and adding them into 4-bit fields.  Since the largest
> possible sum is 2+2 = 4, which will not fit into a 4-bit field, the 2-bit
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                          "which will not fit into a 2-bit field"

> fields have to be masked before they are added.
> 
> 
> After this point, the masking can be delayed.  Each 4-bit field holds
> a population count from 0..4, taking at most 3 bits.  These numbers can
> be added without overflowing a 4-bit field, so we can compute
> c + (c >> 4), and only then mask off the unwanted bits.
> 
> 
> This produces d, a number of 4 8-bit fields, each in the range 0..8.
> >From this point, we can shift and add d multiple times without overflowing
> an 8-bit field, and only do a final mask at the end.
> 
> The number to mask with has to be at least 63 (so that 32 on't be truncated),
> but can also be 128 or 255.  The x86 has a special encoding for signed
> immediate byte values -128..127, so the value of 255 is slower.  On
> other processors, a special "sign extend byte" instruction might be faster.
> 
> 
> On a processor with fast integer multiplies (Athlon but not P4), you can
> reduce the final few serially dependent instructions to a single integer
> multiply.  Consider d to be 3 8-bit values d3, d2, d1 and d0, each in the
> range 0..8.  The multiply forms the partial products:
> 
>            d3 d2 d1 d0
>         d3 d2 d1 d0
>      d3 d2 d1 d0
> + d3 d2 d1 d0
> ----------------------
>            e3 e2 e1 e0
> 
> Where e3 = d3 + d2 + d1 + d0.   e2, e1 and e0 obviously cannot generate
> any carries.

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>


 lib/hweight.c |   29 ++++++++++++++---------------
 1 files changed, 14 insertions(+), 15 deletions(-)

Index: 2.6-rc/lib/hweight.c
===================================================================
--- 2.6-rc.orig/lib/hweight.c
+++ 2.6-rc/lib/hweight.c
@@ -10,28 +10,28 @@
 
 unsigned int hweight32(unsigned int w)
 {
-	unsigned int res = (w & 0x55555555) + ((w >> 1) & 0x55555555);
+	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
-	res = (res & 0x0F0F0F0F) + ((res >> 4) & 0x0F0F0F0F);
-	res = (res & 0x00FF00FF) + ((res >> 8) & 0x00FF00FF);
-	return (res & 0x0000FFFF) + ((res >> 16) & 0x0000FFFF);
+	res = (res + (res >> 4)) & 0x0F0F0F0F;
+	res = res + (res >> 8);
+	return (res + (res >> 16)) & 0x000000FF;
 }
 EXPORT_SYMBOL(hweight32);
 
 unsigned int hweight16(unsigned int w)
 {
-	unsigned int res = (w & 0x5555) + ((w >> 1) & 0x5555);
+	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
-	res = (res & 0x0F0F) + ((res >> 4) & 0x0F0F);
-	return (res & 0x00FF) + ((res >> 8) & 0x00FF);
+	res = (res + (res >> 4)) & 0x0F0F;
+	return (res + (res >> 8)) & 0x00FF;
 }
 EXPORT_SYMBOL(hweight16);
 
 unsigned int hweight8(unsigned int w)
 {
-	unsigned int res = (w & 0x55) + ((w >> 1) & 0x55);
+	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
-	return (res & 0x0F) + ((res >> 4) & 0x0F);
+	return (res + (res >> 4)) & 0x0F;
 }
 EXPORT_SYMBOL(hweight8);
 
@@ -40,13 +40,12 @@ unsigned long hweight64(__u64 w)
 #if BITS_PER_LONG == 32
 	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
-	u64 res;
-	res = (w & 0x5555555555555555ul) + ((w >> 1) & 0x5555555555555555ul);
+	__u64 res = w - ((w >> 1) & 0x5555555555555555ul);
 	res = (res & 0x3333333333333333ul) + ((res >> 2) & 0x3333333333333333ul);
-	res = (res & 0x0F0F0F0F0F0F0F0Ful) + ((res >> 4) & 0x0F0F0F0F0F0F0F0Ful);
-	res = (res & 0x00FF00FF00FF00FFul) + ((res >> 8) & 0x00FF00FF00FF00FFul);
-	res = (res & 0x0000FFFF0000FFFFul) + ((res >> 16) & 0x0000FFFF0000FFFFul);
-	return (res & 0x00000000FFFFFFFFul) + ((res >> 32) & 0x00000000FFFFFFFFul);
+	res = (res + (res >> 4)) & 0x0F0F0F0F0F0F0F0Ful;
+	res = res + (res >> 8);
+	res = res + (res >> 16);
+	return (res + (res >> 32)) & 0x00000000000000FFul;
 #else
 #error BITS_PER_LONG not defined
 #endif

--

     prev parent reply	other threads:[~2006-02-14  5:05 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-14  5:03 [patch 00/47] generic bitops Akinobu Mita
2006-02-14  5:03 ` [patch 01/47] alpha: use config options instead of __alpha_fix__ and __alpha_cix__ Akinobu Mita
2006-02-14  5:03 ` [patch 02/47] ia64: use cpu_set() instead of __set_bit() Akinobu Mita
2006-02-14  5:03 ` [patch 03/47] parisc: add ()-pair in __ffz() macro Akinobu Mita
2006-02-14  5:03 ` [patch 04/47] cris: remove unnecessary local_irq_restore() Akinobu Mita
2006-02-14  5:03 ` [patch 05/47] use non atomic operations for minix_*_bit() and ext2_*_bit() Akinobu Mita
2006-02-14  5:03 ` [patch 06/47] generic {,test_and_}{set,clear,change}_bit() Akinobu Mita
2006-02-14  5:03 ` [patch 07/47] generic __{,test_and_}{set,clear,change}_bit() and test_bit() Akinobu Mita
2006-02-14  5:03 ` [patch 08/47] generic __ffs() Akinobu Mita
2006-02-14  5:04 ` [patch 09/47] generic ffz() Akinobu Mita
2006-02-14  5:04 ` [patch 10/47] generic fls() Akinobu Mita
2006-02-14  5:04 ` [patch 11/47] generic fls64() Akinobu Mita
2006-02-14  5:04 ` [patch 12/47] generic find_{next,first}{,_zero}_bit() Akinobu Mita
2006-02-14  5:04 ` [patch 13/47] generic sched_find_first_bit() Akinobu Mita
2006-02-14  5:04 ` [patch 14/47] generic ffs() Akinobu Mita
2006-02-14  5:04 ` [patch 15/47] generic hweight{64,32,16,8}() Akinobu Mita
2006-02-14  5:04 ` [patch 16/47] generic ext2_{set,clear,test,find_first_zero,find_next_zero}_bit() Akinobu Mita
2006-02-14  5:04 ` [patch 17/47] generic ext2_{set,clear}_bit_atomic() Akinobu Mita
2006-02-14  5:04 ` [patch 18/47] generic minix_{test,set,test_and_clear,test,find_first_zero}_bit() Akinobu Mita
2006-02-14  5:04 ` [patch 19/47] alpha: use generic bitops Akinobu Mita
2006-02-14  5:04 ` [patch 20/47] arm: " Akinobu Mita
2006-02-14  5:04 ` [patch 21/47] arm26: " Akinobu Mita
2006-02-14  5:04 ` [patch 22/47] cris: " Akinobu Mita
2006-02-14  5:04 ` [patch 23/47] frv: " Akinobu Mita
2006-02-14  5:04 ` [patch 24/47] h8300: " Akinobu Mita
2006-02-14  5:04 ` [patch 25/47] i386: " Akinobu Mita
2006-02-14  5:04 ` [patch 26/47] ia64: " Akinobu Mita
2006-02-14  5:04 ` [patch 27/47] m32r: " Akinobu Mita
2006-02-14  5:04 ` [patch 28/47] m68k: " Akinobu Mita
2006-02-14  5:04 ` [patch 29/47] m68knommu: " Akinobu Mita
2006-02-14  5:04 ` [patch 30/47] mips: " Akinobu Mita
2006-02-14  5:04 ` [patch 31/47] parisc: " Akinobu Mita
2006-02-14  5:04 ` [patch 32/47] powerpc: " Akinobu Mita
2006-02-14  5:04 ` [patch 33/47] s390: " Akinobu Mita
2006-02-14  5:04 ` [patch 34/47] sh: " Akinobu Mita
2006-02-14  5:04 ` [patch 35/47] sh64: " Akinobu Mita
2006-02-14  5:04 ` [patch 36/47] sparc: " Akinobu Mita
2006-02-14  5:04 ` [patch 37/47] sparc64: " Akinobu Mita
2006-02-14  5:04 ` [patch 38/47] v850: " Akinobu Mita
2006-02-14  5:04 ` [patch 39/47] x86_64: " Akinobu Mita
2006-02-14  5:04 ` [patch 40/47] xtensa: " Akinobu Mita
2006-02-14  5:04 ` [patch 41/47] update include/asm-generic/bitops.h Akinobu Mita
2006-02-14  5:04 ` [patch 42/47] make thread_info.flags an unsigned long Akinobu Mita
2006-02-14  5:04 ` [patch 43/47] ia64: make partial_page.bitmap " Akinobu Mita
2006-02-14  5:04 ` [patch 44/47] ntfs: remove generic_ffs() Akinobu Mita
2006-02-14  5:04 ` [patch 45/47] remove unused generic bitops in include/linux/bitops.h Akinobu Mita
2006-02-14  5:04 ` [patch 46/47] hweight() related cleanup Akinobu Mita
2006-02-14  5:04 ` Akinobu Mita [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060214050450.536209000@localhost.localdomain \
    --to=mita@miraclelinux.com \
    --cc=akpm@osdl.org \
    --cc=bsingharora@gmail.com \
    --cc=iod00d@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@horizon.com \
    --cc=paubert@iram.es \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox