From: Akinobu Mita <mita@miraclelinux.com>
To: linux-kernel@vger.kernel.org
Cc: akpm@osdl.org, linux@horizon.com,
Balbir Singh <bsingharora@gmail.com>,
Grant Grundler <iod00d@hp.com>, Gabriel Paubert <paubert@iram.es>
Subject: [patch 47/47] hweight() speedup
Date: Tue, 14 Feb 2006 14:04:38 +0900 [thread overview]
Message-ID: <20060214050450.536209000@localhost.localdomain> (raw)
In-Reply-To: 20060214050351.252615000@localhost.localdomain
[-- Attachment #1: hweight-speedup.patch --]
[-- Type: text/plain, Size: 5731 bytes --]
linux@horizon.com wrote:
> This is an extremely well-known technique. You can see a similar version
> that uses a multiply for the last few steps at
> http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
> whch refers to
> "Software Optimization Guide for AMD Athlon 64 and Opteron Processors"
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
>
> It's section 8.6, "Efficient Implementation of Population-Count Function
> in 32-bit Mode", pages 179-180.
>
> It uses the name that I am more familiar with, "popcunt" (population count),
> although "Hamming weight" also makes sense.
>
> Anyway, the proof of correctness proceeds as follows:
>
> b = a - ((a >> 1) & 0x55555555);
> c = (b & 0x33333333) + ((b >> 2) & 0x33333333);
> d = (c + (c >> 4)) & 0x0f0f0f0f;
> #if SLOW_MULTIPLY
> e = d + (d >> 8)
> f = e + (e >> 16);
> return f & 63;
> #else
> /* Useful if multiply takes at most 4 cycles */
> return (d * 0x01010101) >> 24;
> #endif
>
> The input value a can be thought of as 32 1-bit fields each holding
> their own hamming weight. Now look at it as 16 2-bit fields.
> Each 2-bit field a1..a0 has the value 2*a1 + a0. This can be converted
> into the hamming weight of the 2-bit field a1+a0 by subtracting a1.
>
> That's what the (a >> 1) & mask subtraction does. Since there can be no
> borrows, you can just do it all at once.
>
> Enumerating the 4 possible cases:
>
> 0b00 = 0 -> 0 - 0 = 0
> 0b01 = 1 -> 1 - 0 = 1
> 0b10 = 2 -> 2 - 1 = 1
> 0b11 = 3 -> 3 - 1 = 2
>
>
> The next step consists of breaking up b (made of 16 2-bir fields) into
> even and odd halves and adding them into 4-bit fields. Since the largest
> possible sum is 2+2 = 4, which will not fit into a 4-bit field, the 2-bit
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"which will not fit into a 2-bit field"
> fields have to be masked before they are added.
>
>
> After this point, the masking can be delayed. Each 4-bit field holds
> a population count from 0..4, taking at most 3 bits. These numbers can
> be added without overflowing a 4-bit field, so we can compute
> c + (c >> 4), and only then mask off the unwanted bits.
>
>
> This produces d, a number of 4 8-bit fields, each in the range 0..8.
> >From this point, we can shift and add d multiple times without overflowing
> an 8-bit field, and only do a final mask at the end.
>
> The number to mask with has to be at least 63 (so that 32 on't be truncated),
> but can also be 128 or 255. The x86 has a special encoding for signed
> immediate byte values -128..127, so the value of 255 is slower. On
> other processors, a special "sign extend byte" instruction might be faster.
>
>
> On a processor with fast integer multiplies (Athlon but not P4), you can
> reduce the final few serially dependent instructions to a single integer
> multiply. Consider d to be 3 8-bit values d3, d2, d1 and d0, each in the
> range 0..8. The multiply forms the partial products:
>
> d3 d2 d1 d0
> d3 d2 d1 d0
> d3 d2 d1 d0
> + d3 d2 d1 d0
> ----------------------
> e3 e2 e1 e0
>
> Where e3 = d3 + d2 + d1 + d0. e2, e1 and e0 obviously cannot generate
> any carries.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
lib/hweight.c | 29 ++++++++++++++---------------
1 files changed, 14 insertions(+), 15 deletions(-)
Index: 2.6-rc/lib/hweight.c
===================================================================
--- 2.6-rc.orig/lib/hweight.c
+++ 2.6-rc/lib/hweight.c
@@ -10,28 +10,28 @@
unsigned int hweight32(unsigned int w)
{
- unsigned int res = (w & 0x55555555) + ((w >> 1) & 0x55555555);
+ unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
- res = (res & 0x0F0F0F0F) + ((res >> 4) & 0x0F0F0F0F);
- res = (res & 0x00FF00FF) + ((res >> 8) & 0x00FF00FF);
- return (res & 0x0000FFFF) + ((res >> 16) & 0x0000FFFF);
+ res = (res + (res >> 4)) & 0x0F0F0F0F;
+ res = res + (res >> 8);
+ return (res + (res >> 16)) & 0x000000FF;
}
EXPORT_SYMBOL(hweight32);
unsigned int hweight16(unsigned int w)
{
- unsigned int res = (w & 0x5555) + ((w >> 1) & 0x5555);
+ unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
- res = (res & 0x0F0F) + ((res >> 4) & 0x0F0F);
- return (res & 0x00FF) + ((res >> 8) & 0x00FF);
+ res = (res + (res >> 4)) & 0x0F0F;
+ return (res + (res >> 8)) & 0x00FF;
}
EXPORT_SYMBOL(hweight16);
unsigned int hweight8(unsigned int w)
{
- unsigned int res = (w & 0x55) + ((w >> 1) & 0x55);
+ unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
- return (res & 0x0F) + ((res >> 4) & 0x0F);
+ return (res + (res >> 4)) & 0x0F;
}
EXPORT_SYMBOL(hweight8);
@@ -40,13 +40,12 @@ unsigned long hweight64(__u64 w)
#if BITS_PER_LONG == 32
return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
- u64 res;
- res = (w & 0x5555555555555555ul) + ((w >> 1) & 0x5555555555555555ul);
+ __u64 res = w - ((w >> 1) & 0x5555555555555555ul);
res = (res & 0x3333333333333333ul) + ((res >> 2) & 0x3333333333333333ul);
- res = (res & 0x0F0F0F0F0F0F0F0Ful) + ((res >> 4) & 0x0F0F0F0F0F0F0F0Ful);
- res = (res & 0x00FF00FF00FF00FFul) + ((res >> 8) & 0x00FF00FF00FF00FFul);
- res = (res & 0x0000FFFF0000FFFFul) + ((res >> 16) & 0x0000FFFF0000FFFFul);
- return (res & 0x00000000FFFFFFFFul) + ((res >> 32) & 0x00000000FFFFFFFFul);
+ res = (res + (res >> 4)) & 0x0F0F0F0F0F0F0F0Ful;
+ res = res + (res >> 8);
+ res = res + (res >> 16);
+ return (res + (res >> 32)) & 0x00000000000000FFul;
#else
#error BITS_PER_LONG not defined
#endif
--
prev parent reply other threads:[~2006-02-14 5:05 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-14 5:03 [patch 00/47] generic bitops Akinobu Mita
2006-02-14 5:03 ` [patch 01/47] alpha: use config options instead of __alpha_fix__ and __alpha_cix__ Akinobu Mita
2006-02-14 5:03 ` [patch 02/47] ia64: use cpu_set() instead of __set_bit() Akinobu Mita
2006-02-14 5:03 ` [patch 03/47] parisc: add ()-pair in __ffz() macro Akinobu Mita
2006-02-14 5:03 ` [patch 04/47] cris: remove unnecessary local_irq_restore() Akinobu Mita
2006-02-14 5:03 ` [patch 05/47] use non atomic operations for minix_*_bit() and ext2_*_bit() Akinobu Mita
2006-02-14 5:03 ` [patch 06/47] generic {,test_and_}{set,clear,change}_bit() Akinobu Mita
2006-02-14 5:03 ` [patch 07/47] generic __{,test_and_}{set,clear,change}_bit() and test_bit() Akinobu Mita
2006-02-14 5:03 ` [patch 08/47] generic __ffs() Akinobu Mita
2006-02-14 5:04 ` [patch 09/47] generic ffz() Akinobu Mita
2006-02-14 5:04 ` [patch 10/47] generic fls() Akinobu Mita
2006-02-14 5:04 ` [patch 11/47] generic fls64() Akinobu Mita
2006-02-14 5:04 ` [patch 12/47] generic find_{next,first}{,_zero}_bit() Akinobu Mita
2006-02-14 5:04 ` [patch 13/47] generic sched_find_first_bit() Akinobu Mita
2006-02-14 5:04 ` [patch 14/47] generic ffs() Akinobu Mita
2006-02-14 5:04 ` [patch 15/47] generic hweight{64,32,16,8}() Akinobu Mita
2006-02-14 5:04 ` [patch 16/47] generic ext2_{set,clear,test,find_first_zero,find_next_zero}_bit() Akinobu Mita
2006-02-14 5:04 ` [patch 17/47] generic ext2_{set,clear}_bit_atomic() Akinobu Mita
2006-02-14 5:04 ` [patch 18/47] generic minix_{test,set,test_and_clear,test,find_first_zero}_bit() Akinobu Mita
2006-02-14 5:04 ` [patch 19/47] alpha: use generic bitops Akinobu Mita
2006-02-14 5:04 ` [patch 20/47] arm: " Akinobu Mita
2006-02-14 5:04 ` [patch 21/47] arm26: " Akinobu Mita
2006-02-14 5:04 ` [patch 22/47] cris: " Akinobu Mita
2006-02-14 5:04 ` [patch 23/47] frv: " Akinobu Mita
2006-02-14 5:04 ` [patch 24/47] h8300: " Akinobu Mita
2006-02-14 5:04 ` [patch 25/47] i386: " Akinobu Mita
2006-02-14 5:04 ` [patch 26/47] ia64: " Akinobu Mita
2006-02-14 5:04 ` [patch 27/47] m32r: " Akinobu Mita
2006-02-14 5:04 ` [patch 28/47] m68k: " Akinobu Mita
2006-02-14 5:04 ` [patch 29/47] m68knommu: " Akinobu Mita
2006-02-14 5:04 ` [patch 30/47] mips: " Akinobu Mita
2006-02-14 5:04 ` [patch 31/47] parisc: " Akinobu Mita
2006-02-14 5:04 ` [patch 32/47] powerpc: " Akinobu Mita
2006-02-14 5:04 ` [patch 33/47] s390: " Akinobu Mita
2006-02-14 5:04 ` [patch 34/47] sh: " Akinobu Mita
2006-02-14 5:04 ` [patch 35/47] sh64: " Akinobu Mita
2006-02-14 5:04 ` [patch 36/47] sparc: " Akinobu Mita
2006-02-14 5:04 ` [patch 37/47] sparc64: " Akinobu Mita
2006-02-14 5:04 ` [patch 38/47] v850: " Akinobu Mita
2006-02-14 5:04 ` [patch 39/47] x86_64: " Akinobu Mita
2006-02-14 5:04 ` [patch 40/47] xtensa: " Akinobu Mita
2006-02-14 5:04 ` [patch 41/47] update include/asm-generic/bitops.h Akinobu Mita
2006-02-14 5:04 ` [patch 42/47] make thread_info.flags an unsigned long Akinobu Mita
2006-02-14 5:04 ` [patch 43/47] ia64: make partial_page.bitmap " Akinobu Mita
2006-02-14 5:04 ` [patch 44/47] ntfs: remove generic_ffs() Akinobu Mita
2006-02-14 5:04 ` [patch 45/47] remove unused generic bitops in include/linux/bitops.h Akinobu Mita
2006-02-14 5:04 ` [patch 46/47] hweight() related cleanup Akinobu Mita
2006-02-14 5:04 ` Akinobu Mita [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060214050450.536209000@localhost.localdomain \
--to=mita@miraclelinux.com \
--cc=akpm@osdl.org \
--cc=bsingharora@gmail.com \
--cc=iod00d@hp.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@horizon.com \
--cc=paubert@iram.es \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox