Re: [PATCH 2/5] bitops: compile time optimization for hweight

linux-kbuild.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
       [not found]                   ` <20100212174751.GD3114@aftab>
@ 2010-02-12 19:05                     ` H. Peter Anvin
  2010-02-17 13:57                       ` Michal Marek
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-12 19:05 UTC (permalink / raw)
  To: Borislav Petkov, Michal Marek, linux-kbuild
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> 
> However, this is generic code and for the above to work we have to
> enforce x86-specific CFLAGS for it. What is the preferred way to do
> that?
> 

That's a question for Michal and the kbuild list.  Michal?

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 19:05                     ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) H. Peter Anvin
@ 2010-02-17 13:57                       ` Michal Marek
  2010-02-17 17:20                         ` Borislav Petkov
  2010-02-18 10:51                         ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Michal Marek @ 2010-02-17 13:57 UTC (permalink / raw)
  To: H. Peter Anvin, Borislav Petkov
  Cc: linux-kbuild, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 12.2.2010 20:05, H. Peter Anvin wrote:
> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>
>> However, this is generic code and for the above to work we have to
>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>> that?
>>
> 
> That's a question for Michal and the kbuild list.  Michal?

(I was offline last week).

The _preferred_ way probably is not to do it :), but otherwise you can
set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
in arch/x86/lib/Makefile already.

Michal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 13:57                       ` Michal Marek
@ 2010-02-17 17:20                         ` Borislav Petkov
  2010-02-17 17:31                           ` Michal Marek
  2010-02-18 10:51                         ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-17 17:20 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> > 
> > That's a question for Michal and the kbuild list.  Michal?
> 
> (I was offline last week).
> 
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
which gets set in the arch Makefile instead?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:20                         ` Borislav Petkov
@ 2010-02-17 17:31                           ` Michal Marek
  2010-02-17 17:34                             ` Borislav Petkov
  2010-02-17 17:39                             ` Michal Marek
  0 siblings, 2 replies; 37+ messages in thread
From: Michal Marek @ 2010-02-17 17:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 17.2.2010 18:20, Borislav Petkov wrote:
> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>
>>>> However, this is generic code and for the above to work we have to
>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>> that?
>>>>
>>>
>>> That's a question for Michal and the kbuild list.  Michal?
>>
>> (I was offline last week).
>>
>> The _preferred_ way probably is not to do it :), but otherwise you can
>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>> in arch/x86/lib/Makefile already.
> 
> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> which gets set in the arch Makefile instead?

We could, but is it worth it if there is only one potential user so far?
IMO just put the condition to lib/Makefile now and if there turn out to
be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Michal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:31                           ` Michal Marek
@ 2010-02-17 17:34                             ` Borislav Petkov
  2010-02-17 17:39                             ` Michal Marek
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2010-02-17 17:34 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 06:31:04PM +0100, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
> > On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> >> On 12.2.2010 20:05, H. Peter Anvin wrote:
> >>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>>>
> >>>> However, this is generic code and for the above to work we have to
> >>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >>>> that?
> >>>>
> >>>
> >>> That's a question for Michal and the kbuild list.  Michal?
> >>
> >> (I was offline last week).
> >>
> >> The _preferred_ way probably is not to do it :), but otherwise you can
> >> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> >> in arch/x86/lib/Makefile already.
> > 
> > Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> > which gets set in the arch Makefile instead?
> 
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Ok, I'm fine with that too, thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:31                           ` Michal Marek
  2010-02-17 17:34                             ` Borislav Petkov
@ 2010-02-17 17:39                             ` Michal Marek
  2010-02-18  6:19                               ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: Michal Marek @ 2010-02-17 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 17.2.2010 18:31, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
>> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>>
>>>>> However, this is generic code and for the above to work we have to
>>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>>> that?
>>>>>
>>>>
>>>> That's a question for Michal and the kbuild list.  Michal?
>>>
>>> (I was offline last week).
>>>
>>> The _preferred_ way probably is not to do it :), but otherwise you can
>>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>>> in arch/x86/lib/Makefile already.
>>
>> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
>> which gets set in the arch Makefile instead?
> 
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

It wouldn't work actually, because such variable would then apply to all
hweight.o targets in the tree. But another way would be:

arch/x86/Kconfig
config ARCH_HWEIGHT_CFLAGS
        string
        default "..." if X86_32
        default "..." if X86_64

lib/Makefile
CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Michal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:39                             ` Michal Marek
@ 2010-02-18  6:19                               ` Borislav Petkov
  2010-02-19 14:22                                 ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-18  6:19 UTC (permalink / raw)
  To: Michal Marek
  Cc: Borislav Petkov, H. Peter Anvin, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 06:39:13PM +0100, Michal Marek wrote:
> It wouldn't work actually, because such variable would then apply to all
> hweight.o targets in the tree. But another way would be:
> 
> arch/x86/Kconfig
> config ARCH_HWEIGHT_CFLAGS
>         string
>         default "..." if X86_32
>         default "..." if X86_64
> 
> lib/Makefile
> CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Yep, this works, albeit with a small adjustment since
CONFIG_ARCH_HWEIGHT_CFLAGS is quoted in the Kconfig and the quotes
appear in the $(CC) call like this:

gcc -Wp,-MD,lib/.hweight.o.d  ... "-fcall-saved-ecx..."

which I fixed like this (idea reused from the make manual):

---
 source "init/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="



I'm open for better suggestions though.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 13:57                       ` Michal Marek
  2010-02-17 17:20                         ` Borislav Petkov
@ 2010-02-18 10:51                         ` Peter Zijlstra
  2010-02-18 11:51                           ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2010-02-18 10:51 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, Borislav Petkov, linux-kbuild, Borislav Petkov,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, 2010-02-17 at 14:57 +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> > 
> > That's a question for Michal and the kbuild list.  Michal?
> 
> (I was offline last week).
> 
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

I guess one way to achieve that is to create a arch/x86/lib/hweight.c
that includes lib/hweight.c and give the x86 one special compile flags
and not build the lib on.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-18 10:51                         ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
@ 2010-02-18 11:51                           ` Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2010-02-18 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Marek, H. Peter Anvin, linux-kbuild, Borislav Petkov,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Thu, Feb 18, 2010 at 11:51:50AM +0100, Peter Zijlstra wrote:
> I guess one way to achieve that is to create a arch/x86/lib/hweight.c
> that includes lib/hweight.c and give the x86 one special compile flags
> and not build the lib on.

That's what I thought initially too but that won't fly because the
lib/hweight.c helpers have to be inlined into arch/x86/lib/hweight.c so
that gcc can take care of the clobbered registers. Otherwise, it just a
"call __sw_hweightXX" that gets issued into asm.


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] x86: Add optimized popcnt variants
  2010-02-18  6:19                               ` Borislav Petkov
@ 2010-02-19 14:22                                 ` Borislav Petkov
  2010-02-19 16:06                                   ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-19 14:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, Borislav Petkov, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Borislav Petkov <borislav.petkov@amd.com>

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/bitops.h             |    7 +++-
 arch/x86/lib/Makefile                     |    2 +-
 arch/x86/lib/hweight.c                    |   62 +++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h |   22 ++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 109 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..5ec3bd8 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,12 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..54d3cb0
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..1c82306 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.5.4

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 14:22                                 ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
@ 2010-02-19 16:06                                   ` H. Peter Anvin
  2010-02-19 16:45                                     ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-19 16:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/19/2010 06:22 AM, Borislav Petkov wrote:
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * __sw_hweightXX are called from within the alternatives below
> + * and callee-clobbered registers need to be taken care of. See
> + * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
> + * compiler switches.
> + */
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +
> +	return res;
> +}
> +EXPORT_SYMBOL(__arch_hweight32);
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	return __arch_hweight32(w & 0xffff);
> +}
> +EXPORT_SYMBOL(__arch_hweight16);
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> +	return __arch_hweight32(w & 0xff);
> +}
> +EXPORT_SYMBOL(__arch_hweight8);
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> +	unsigned long res = 0;
> +
> +#ifdef CONFIG_X86_32
> +	return  __arch_hweight32((u32)w) +
> +		__arch_hweight32((u32)(w >> 32));
> +#else
> +	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +#endif /* CONFIG_X86_32 */
> +
> +	return res;
> +}

You're still not inlining these.  They should be: there is absolutely no
reason for code size to not inline them anymore.

> diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
> index 3a7be84..1c82306 100644
> --- a/include/asm-generic/bitops/arch_hweight.h
> +++ b/include/asm-generic/bitops/arch_hweight.h
> @@ -3,9 +3,23 @@
>  
>  #include <asm/types.h>
>  
> -extern unsigned int __arch_hweight32(unsigned int w);
> -extern unsigned int __arch_hweight16(unsigned int w);
> -extern unsigned int __arch_hweight8(unsigned int w);
> -extern unsigned long __arch_hweight64(__u64 w);
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	return __sw_hweight32(w);
> +}
>  
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	return __sw_hweight16(w);
> +}
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> +	return __sw_hweight8(w);
> +}
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> +	return __sw_hweight64(w);
> +}
>  #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */

and these are in a header file and *definitely* should be inlines.

	-hpa
-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:06                                   ` H. Peter Anvin
@ 2010-02-19 16:45                                     ` Borislav Petkov
  2010-02-19 16:53                                       ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-19 16:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 19, 2010 at 08:06:07AM -0800

<snip>

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > +	unsigned long res = 0;
> > +
> > +#ifdef CONFIG_X86_32
> > +	return  __arch_hweight32((u32)w) +
> > +		__arch_hweight32((u32)(w >> 32));
> > +#else
> > +	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> > +		     : "="REG_OUT (res)
> > +		     : REG_IN (w));
> > +#endif /* CONFIG_X86_32 */
> > +
> > +	return res;
> > +}
> 
> You're still not inlining these.  They should be: there is absolutely no
> reason for code size to not inline them anymore.

Isn't better to have only those 4 locations for apply_alternatives to
patch wrt to popcnt instead of sprinkling alternatives sections around
the kernel in every callsite of hweight and its users? Or is the aim to
optimize even that "call __arch_hweightXX" away?

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > +	return __sw_hweight64(w);
> > +}
> >  #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
> 
> and these are in a header file and *definitely* should be inlines.

Yep, done.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:45                                     ` Borislav Petkov
@ 2010-02-19 16:53                                       ` H. Peter Anvin
  2010-02-22 14:17                                         ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-19 16:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/19/2010 08:45 AM, Borislav Petkov wrote:
>>
>> You're still not inlining these.  They should be: there is absolutely no
>> reason for code size to not inline them anymore.
> 
> Isn't better to have only those 4 locations for apply_alternatives to
> patch wrt to popcnt instead of sprinkling alternatives sections around
> the kernel in every callsite of hweight and its users? Or is the aim to
> optimize even that "call __arch_hweightXX" away?
> 

That's the idea, yes.  We use inline alternatives in quite a few other
places.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:53                                       ` H. Peter Anvin
@ 2010-02-22 14:17                                         ` Borislav Petkov
  2010-02-22 17:21                                           ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-22 14:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 19, 2010 at 08:53:32AM -0800

> That's the idea, yes.  We use inline alternatives in quite a few other
> places.

Ok, inlining results in circa 100+ replacements here both on 32- and
64-bit. Here we go:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..f79b733
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 14:17                                         ` Borislav Petkov
@ 2010-02-22 17:21                                           ` H. Peter Anvin
  2010-02-22 18:49                                             ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-22 17:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/22/2010 06:17 AM, Borislav Petkov wrote:
>  
> +config ARCH_HWEIGHT_CFLAGS
> +	string
> +	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> +

[...]

> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else

Just a note: this still means rdi is clobbered on x86-64, which is
probably fine, but needs to be recorded as such.  Since gcc doesn't
support clobbers for registers used as operands (sigh), you have to
create a dummy output and assign it a "=D" constraint.

I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
reliably.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 17:21                                           ` H. Peter Anvin
@ 2010-02-22 18:49                                             ` Borislav Petkov
  2010-02-22 19:55                                               ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-22 18:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 09:21:05AM -0800

> On 02/22/2010 06:17 AM, Borislav Petkov wrote:
> >  
> > +config ARCH_HWEIGHT_CFLAGS
> > +	string
> > +	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> > +
> 
> [...]
> 
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> 
> Just a note: this still means rdi is clobbered on x86-64, which is
> probably fine, but needs to be recorded as such.  Since gcc doesn't
> support clobbers for registers used as operands (sigh), you have to
> create a dummy output and assign it a "=D" constraint.
> 
> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> reliably.

Ok, from looking at kernel/sched.s output it looks like it saves rdi
content over the alternative where needed. I'll do some more testing
just to make sure.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..cc3a188
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0, dummy;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res), "="REG_IN (dummy)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 18:49                                             ` Borislav Petkov
@ 2010-02-22 19:55                                               ` H. Peter Anvin
  2010-02-23  6:37                                                 ` Borislav Petkov
  2010-02-23 15:58                                                 ` Borislav Petkov
  0 siblings, 2 replies; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-22 19:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/22/2010 10:49 AM, Borislav Petkov wrote:
>>
>> Just a note: this still means rdi is clobbered on x86-64, which is
>> probably fine, but needs to be recorded as such.  Since gcc doesn't
>> support clobbers for registers used as operands (sigh), you have to
>> create a dummy output and assign it a "=D" constraint.
>>
>> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
>> reliably.
> 
> Ok, from looking at kernel/sched.s output it looks like it saves rdi
> content over the alternative where needed. I'll do some more testing
> just to make sure.
> 

No, you can't rely on behavioral observation.  A different version of
gcc could behave differently.  We need to make sure we tell gcc what the
requirements actually are, as opposed to thinking we can just fix them.

+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
0xb8\n\t.byte 0xc7"

BTW, this can be written:

#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 19:55                                               ` H. Peter Anvin
@ 2010-02-23  6:37                                                 ` Borislav Petkov
  2010-02-23 15:58                                                 ` Borislav Petkov
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2010-02-23  6:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such.  Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.
> > 
> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> > 
> 
> No, you can't rely on behavioral observation.  A different version of
> gcc could behave differently.  We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Ok, I've added the dummy "=D" constraint since it sounded like the more
stable thing to do WRT gcc versions. BTW, I left the machine overnight
and it is still alive cheerfully building randconfigs.

I'll try the -fcall-saved-rdi thing also later today.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
> 
> BTW, this can be written:
> 
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

done, updated version coming up.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 19:55                                               ` H. Peter Anvin
  2010-02-23  6:37                                                 ` Borislav Petkov
@ 2010-02-23 15:58                                                 ` Borislav Petkov
  2010-02-23 17:34                                                   ` H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-23 15:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such.  Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.

Hmm, we cannot do that with the current design since __arch_hweight64
is being inlined into every callsite and AFAICT we would have to build
every callsite with "-fcall-saved-rdi" which is clearly too much. The
explicit "=D" dummy constraint is straightforward, instead.

> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> > 
> 
> No, you can't rely on behavioral observation.  A different version of
> gcc could behave differently.  We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Just to make clear: those observations were referring to the version
_with_ the dummy "=D" constraint - I was simply verifying the asm
output and wasn't relying on behavioral observation.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
> 
> BTW, this can be written:
> 
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

Done, here's the latest version, it boots fine and testing is ongoing:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   61 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 ++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 110 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..eaefadc
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,61 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+	/* tell gcc that %rdi is clobbered as an input operand */
+	unsigned long dummy;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res), "="REG_IN (dummy)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 15:58                                                 ` Borislav Petkov
@ 2010-02-23 17:34                                                   ` H. Peter Anvin
  2010-02-23 17:54                                                     ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-23 17:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/23/2010 07:58 AM, Borislav Petkov wrote:
>
> Hmm, we cannot do that with the current design since __arch_hweight64
> is being inlined into every callsite and AFAICT we would have to build
> every callsite with "-fcall-saved-rdi" which is clearly too much. The
> explicit "=D" dummy constraint is straightforward, instead.
>

Uh... the -fcall-saved-rdi would go with all the other ones.  Assuming 
it can actually work and that gcc doesn't choke on an inbound argument 
being saved.

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 17:34                                                   ` H. Peter Anvin
@ 2010-02-23 17:54                                                     ` Borislav Petkov
  2010-02-23 18:17                                                       ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-23 17:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 09:34:04AM -0800

> On 02/23/2010 07:58 AM, Borislav Petkov wrote:
> >
> >Hmm, we cannot do that with the current design since __arch_hweight64
> >is being inlined into every callsite and AFAICT we would have to build
> >every callsite with "-fcall-saved-rdi" which is clearly too much. The
> >explicit "=D" dummy constraint is straightforward, instead.
> >
> 
> Uh... the -fcall-saved-rdi would go with all the other ones.
> Assuming it can actually work and that gcc doesn't choke on an
> inbound argument being saved.

Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
might choke on it, I could leave the "=D" dummy constraint in?

BTW, the current version screams

/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h: In function ‘__arch_hweight64’:
/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h:47: warning: unused variable ‘dummy’

on x86-32. I'll send a fixed version in a second.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 17:54                                                     ` Borislav Petkov
@ 2010-02-23 18:17                                                       ` H. Peter Anvin
  2010-02-23 19:06                                                         ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-23 18:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> 
> Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> might choke on it, I could leave the "=D" dummy constraint in?
> 

I can try it with gcc 3.4 here.  -fcall-saved-rdi is cleaner, if it works.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 18:17                                                       ` H. Peter Anvin
@ 2010-02-23 19:06                                                         ` Borislav Petkov
  2010-02-26  5:27                                                           ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-23 19:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 10:17:39AM -0800

> On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> > 
> > Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> > 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> > might choke on it, I could leave the "=D" dummy constraint in?
> > 
> 
> I can try it with gcc 3.4 here.  -fcall-saved-rdi is cleaner, if it works.

Ok, here you go.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..4673dc5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 19:06                                                         ` Borislav Petkov
@ 2010-02-26  5:27                                                           ` H. Peter Anvin
  2010-02-26  7:47                                                             ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-26  5:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/23/2010 11:06 AM, Borislav Petkov wrote:
> 
> Ok, here you go.
> 
> --
> From: Borislav Petkov <borislav.petkov@amd.com>
> Date: Thu, 11 Feb 2010 00:48:31 +0100
> Subject: [PATCH] x86: Add optimized popcnt variants
> 
> Add support for the hardware version of the Hamming weight function,
> popcnt, present in CPUs which advertize it under CPUID, Function
> 0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
> default lib/hweight.c sw versions.
> 
> A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
> a 3x speedup on a F10h machine.
> 
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>

OK, this patch looks pretty good now, but I'm completely lost as to what
the baseline of this patch is supposed to be.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26  5:27                                                           ` H. Peter Anvin
@ 2010-02-26  7:47                                                             ` Borislav Petkov
  2010-02-26 17:48                                                               ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-26  7:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Thu, Feb 25, 2010 at 09:27:25PM -0800

> OK, this patch looks pretty good now, but I'm completely lost as to
> what the baseline of this patch is supposed to be.

Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119

But I'm not sure which tree has it...

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26  7:47                                                             ` Borislav Petkov
@ 2010-02-26 17:48                                                               ` H. Peter Anvin
  2010-02-27  8:28                                                                 ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-26 17:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/25/2010 11:47 PM, Borislav Petkov wrote:
> From: "H. Peter Anvin" <hpa@zytor.com>
> Date: Thu, Feb 25, 2010 at 09:27:25PM -0800
> 
>> OK, this patch looks pretty good now, but I'm completely lost as to
>> what the baseline of this patch is supposed to be.
> 
> Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119
> 
> But I'm not sure which tree has it...
> 

Looks like -mm, which really means that either Andrew has to take your
patch, too, or we have to wait until that is upstream until we can merge
your patch.

I'm a little nervous about just acking the patch and telling Andrew to
test it, because I don't know what the fallout would look like.  I'm
particularly concerned about gcc version dependencies.

I guess, on the other hand, if it ends up not getting merged until .35
it's not a huge deal either.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26 17:48                                                               ` H. Peter Anvin
@ 2010-02-27  8:28                                                                 ` Borislav Petkov
  2010-02-27 20:00                                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-02-27  8:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 26, 2010 at 09:48:46AM -0800

> Looks like -mm, which really means that either Andrew has to take your
> patch, too, or we have to wait until that is upstream until we can merge
> your patch.
> 
> I'm a little nervous about just acking the patch and telling Andrew to
> test it, because I don't know what the fallout would look like.  I'm
> particularly concerned about gcc version dependencies.

I have the same concern. Actually, I'll be much more at ease if it saw a
bit of wider testing without hitting mainline just yet. I'll try to give
it some more testing with the machines and toolchains I can get my hands
on next week.

> I guess, on the other hand, if it ends up not getting merged until .35
> it's not a huge deal either.

Yeah, let's give it another round of testing and queue it for .35 -
AFAIR Ingo runs also a wide testing effort so it spending another cycle
in -tip and being hammered on by us could give us a bit more certainty.

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27  8:28                                                                 ` Borislav Petkov
@ 2010-02-27 20:00                                                                   ` H. Peter Anvin
  2010-03-09 15:36                                                                     ` Borislav Petkov
                                                                                       ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: H. Peter Anvin @ 2010-02-27 20:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> 
>> I guess, on the other hand, if it ends up not getting merged until .35
>> it's not a huge deal either.
> 
> Yeah, let's give it another round of testing and queue it for .35 -
> AFAIR Ingo runs also a wide testing effort so it spending another cycle
> in -tip and being hammered on by us could give us a bit more certainty.
> 

Yes, if we can get into -tip then we'll get more test coverage, so I'll
queue it up for .35 as soon as the merge window closes.  Please remind
me if I forget.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                   ` H. Peter Anvin
@ 2010-03-09 15:36                                                                     ` Borislav Petkov
  2010-03-09 15:50                                                                       ` Peter Zijlstra
  2010-03-18 11:17                                                                     ` Borislav Petkov
                                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-03-09 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

> On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > Yeah, let's give it another round of testing and queue it for .35 -
> > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > in -tip and being hammered on by us could give us a bit more certainty.
> > 
> 
> Yes, if we can get into -tip then we'll get more test coverage, so I'll
> queue it up for .35 as soon as the merge window closes.  Please remind
> me if I forget.

Hi Peter,

I see that you've added the HWEIGHT-capitalized interfaces for
compile-time constants with fce877e3. Which means, the bits in
<include/asm-generic/bitops/const_hweight.h> from your original patch at
http://lkml.org/lkml/2010/2/4/119 need changing (or have changed already
but I've missed them).

IOW, where can I get the current version of that last patch so that I can
base my __arch_hweightXX stuff ontop of it for testing?

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 15:36                                                                     ` Borislav Petkov
@ 2010-03-09 15:50                                                                       ` Peter Zijlstra
  2010-03-09 16:23                                                                         ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2010-03-09 15:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 16:36 +0100, Borislav Petkov wrote:
> > On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > > Yeah, let's give it another round of testing and queue it for .35 -
> > > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > > in -tip and being hammered on by us could give us a bit more certainty.
> > > 
> > 
> > Yes, if we can get into -tip then we'll get more test coverage, so I'll
> > queue it up for .35 as soon as the merge window closes.  Please remind
> > me if I forget.
> 
> Hi Peter,
> 
> I see that you've added the HWEIGHT-capitalized interfaces for
> compile-time constants with fce877e3. Which means, the bits in
> <include/asm-generic/bitops/const_hweight.h> from your original patch at
> http://lkml.org/lkml/2010/2/4/119 need changing (or have changed already
> but I've missed them).
> 
> IOW, where can I get the current version of that last patch so that I can
> base my __arch_hweightXX stuff ontop of it for testing?

Should all be fine as it is, that patch
( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
in, I've just checked and it still applies to tip/master as of this
writing (although it grew a single 2 line offset for 1 hunk).




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 15:50                                                                       ` Peter Zijlstra
@ 2010-03-09 16:23                                                                         ` Borislav Petkov
  2010-03-09 16:32                                                                           ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-03-09 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, Mar 09, 2010 at 04:50:40PM +0100

> Should all be fine as it is, that patch
> ( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
> in, I've just checked and it still applies to tip/master as of this
> writing (although it grew a single 2 line offset for 1 hunk).

Well, this way, I'm getting

...
In file included from include/linux/kernel.h:15,
                 from /home/linux-2.6/arch/x86/include/asm/percpu.h:45,
                 from /home/linux-2.6/arch/x86/include/asm/current.h:5,
                 from /home/linux-2.6/arch/x86/include/asm/processor.h:15,
                 from /home/linux-2.6/arch/x86/include/asm/atomic.h:6,
                 from include/linux/crypto.h:20,
                 from arch/x86/kernel/asm-offsets_64.c:8,
                 from arch/x86/kernel/asm-offsets.c:4:
include/linux/bitops.h:52:1: warning: "HWEIGHT8" redefined
...

due to the fact that we have multiple definitions of HWEIGHT*:

The one batch is in <include/linux/bitops.h> introduced by fce877e3.

The other is in <include/asm-generic/bitops/const_hweight.h> which
is pulled in into <include/linux/bitops.h> through "#include
<asm/bitops.h>", which, in turn, <includes asm/arch_hweight.h> and
<include/asm-generic/bitops/const_hweight.h>.

The obvious resolution is to remove the HWEIGHT* batch from
<include/asm-generic/bitops/const_hweight.h> since they're functionally
identical with the ones in <include/linux/bitops.h>, no?

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 16:23                                                                         ` Borislav Petkov
@ 2010-03-09 16:32                                                                           ` Peter Zijlstra
  2010-03-09 17:32                                                                             ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2010-03-09 16:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 17:23 +0100, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, Mar 09, 2010 at 04:50:40PM +0100
> 
> > Should all be fine as it is, that patch
> > ( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
> > in, I've just checked and it still applies to tip/master as of this
> > writing (although it grew a single 2 line offset for 1 hunk).
> 
> Well, this way, I'm getting
> 
> ...
> In file included from include/linux/kernel.h:15,
>                  from /home/linux-2.6/arch/x86/include/asm/percpu.h:45,
>                  from /home/linux-2.6/arch/x86/include/asm/current.h:5,
>                  from /home/linux-2.6/arch/x86/include/asm/processor.h:15,
>                  from /home/linux-2.6/arch/x86/include/asm/atomic.h:6,
>                  from include/linux/crypto.h:20,
>                  from arch/x86/kernel/asm-offsets_64.c:8,
>                  from arch/x86/kernel/asm-offsets.c:4:
> include/linux/bitops.h:52:1: warning: "HWEIGHT8" redefined
> ...
> 
> due to the fact that we have multiple definitions of HWEIGHT*:
> 
> The one batch is in <include/linux/bitops.h> introduced by fce877e3.
> 
> The other is in <include/asm-generic/bitops/const_hweight.h> which
> is pulled in into <include/linux/bitops.h> through "#include
> <asm/bitops.h>", which, in turn, <includes asm/arch_hweight.h> and
> <include/asm-generic/bitops/const_hweight.h>.
> 
> The obvious resolution is to remove the HWEIGHT* batch from
> <include/asm-generic/bitops/const_hweight.h> since they're functionally
> identical with the ones in <include/linux/bitops.h>, no?

I thought the patch did that, see this hunk (straight from
http://lkml.org/lkml/2010/2/4/119 ):


---
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -45,31 +45,6 @@ static inline unsigned long hweight_long
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 16:32                                                                           ` Peter Zijlstra
@ 2010-03-09 17:32                                                                             ` Borislav Petkov
  2010-03-09 17:37                                                                               ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2010-03-09 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, Mar 09, 2010 at 05:32:49PM +0100

> I thought the patch did that, see this hunk (straight from
> http://lkml.org/lkml/2010/2/4/119 ):

Bollocks, I seem to have lost that hunk while applying the patch by
foot, sorry for the noise.

By the way, I can't seem to find your patch in Andrew's tree, is it
still going through his tree?

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 17:32                                                                             ` Borislav Petkov
@ 2010-03-09 17:37                                                                               ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2010-03-09 17:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 18:32 +0100, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, Mar 09, 2010 at 05:32:49PM +0100
> 
> > I thought the patch did that, see this hunk (straight from
> > http://lkml.org/lkml/2010/2/4/119 ):
> 
> Bollocks, I seem to have lost that hunk while applying the patch by
> foot, sorry for the noise.
> 
> By the way, I can't seem to find your patch in Andrew's tree, is it
> still going through his tree?

I hope so, Andrew, need me to resend or do you still have a copy?




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                   ` H. Peter Anvin
  2010-03-09 15:36                                                                     ` Borislav Petkov
@ 2010-03-18 11:17                                                                     ` Borislav Petkov
  2010-03-18 11:19                                                                     ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
  2010-03-18 11:20                                                                     ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
  3 siblings, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Sat, Feb 27, 2010 at 12:00:26PM -0800

> On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > 
> >> I guess, on the other hand, if it ends up not getting merged until .35
> >> it's not a huge deal either.
> > 
> > Yeah, let's give it another round of testing and queue it for .35 -
> > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > in -tip and being hammered on by us could give us a bit more certainty.
> > 
> 
> Yes, if we can get into -tip then we'll get more test coverage, so I'll
> queue it up for .35 as soon as the merge window closes.  Please remind
> me if I forget.

Ok, I've been pretty busy lately and this got pushed back on the todo
list. I finally got around to do some build-testing with a bunch of
compilers and -fcall-saved* seem to get accepted. I haven't stared at
their asm output though, yet:


command:
make CC=gcc-<version> HOSTCC=gcc-<version> -j4

compile stats (64bit only):

not ok:
- gcc-3.3 (GCC) 3.3.5 (Debian 1:3.3.5-13):	OOM KILLER goes off, gcc-3.3 leak maybe

ok:
- gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-13sarge1)
- gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-27)
- gcc-4.3 (Debian 4.3.4-6) 4.3.4
- gcc (Debian 4.4.2-6) 4.4.2
- gcc (Debian 4.4.3-3) 4.4.3

- gcc-3.4.6 (GCC) 3.4.6 (Gentoo 3.4.6-r2 p1.6, ssp-3.4.6-1.0, pie-8.7.10)
- gcc-4.1.2 (GCC) 4.1.2 (Gentoo 4.1.2 p1.3)

I'm attaching the versions of the patches I'm using. The first one by
PeterZ touches a bunch of arches and Andrew hasn't picked it up yet so
the question of getting the second (popcnt) patch to see wider testing
in some tree is still unresolved. Suggestions, ideas?

Thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation
  2010-02-27 20:00                                                                   ` H. Peter Anvin
  2010-03-09 15:36                                                                     ` Borislav Petkov
  2010-03-18 11:17                                                                     ` Borislav Petkov
@ 2010-03-18 11:19                                                                     ` Borislav Petkov
  2010-03-18 11:20                                                                     ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
  3 siblings, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon, 1 Feb 2010 15:03:07 +0100
Subject: [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation

Rename the extisting runtime hweight() implementations to
__arch_hweight(), rename the compile-time versions to __const_hweight()
and then have hweight() pick between them.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1265028224.24455.154.camel@laptop>
---
 arch/alpha/include/asm/bitops.h            |   18 ++++++-----
 arch/ia64/include/asm/bitops.h             |   11 ++++---
 arch/sparc/include/asm/bitops_64.h         |   11 ++++---
 include/asm-generic/bitops/arch_hweight.h  |   11 +++++++
 include/asm-generic/bitops/const_hweight.h |   42 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h       |    8 +----
 include/linux/bitops.h                     |   25 ----------------
 lib/hweight.c                              |   19 ++++++------
 8 files changed, 87 insertions(+), 58 deletions(-)
 create mode 100644 include/asm-generic/bitops/arch_hweight.h
 create mode 100644 include/asm-generic/bitops/const_hweight.h

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 15f3ae2..296da1d 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -405,29 +405,31 @@ static inline int fls(int x)
 
 #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
 /* Whee.  EV67 can calculate it directly.  */
-static inline unsigned long hweight64(unsigned long w)
+static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_weight32(unsigned int w)
 {
-	return hweight64(w);
+	return __arch_hweight64(w);
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
-	return hweight64(w & 0xffff);
+	return __arch_hweight64(w & 0xffff);
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
-	return hweight64(w & 0xff);
+	return __arch_hweight64(w & 0xff);
 }
 #else
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 #endif
 
+#include <asm-generic/bitops/const_hweight.h>
+
 #endif /* __KERNEL__ */
 
 #include <asm-generic/bitops/find.h>
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 6ebc229..9da3df6 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -437,17 +437,18 @@ __fls (unsigned long x)
  * hweightN: returns the hamming weight (i.e. the number
  * of bits set) of a N-bit word
  */
-static __inline__ unsigned long
-hweight64 (unsigned long x)
+static __inline__ unsigned long __arch_hweight64(unsigned long x)
 {
 	unsigned long result;
 	result = ia64_popcnt(x);
 	return result;
 }
 
-#define hweight32(x)	(unsigned int) hweight64((x) & 0xfffffffful)
-#define hweight16(x)	(unsigned int) hweight64((x) & 0xfffful)
-#define hweight8(x)	(unsigned int) hweight64((x) & 0xfful)
+#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
+#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
+#define __arch_hweight8(x)  ((unsigned int) __arch_hweight64((x) & 0xfful))
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index e72ac9c..766121a 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -44,7 +44,7 @@ extern void change_bit(unsigned long nr, volatile unsigned long *addr);
 
 #ifdef ULTRA_HAS_POPULATION_COUNT
 
-static inline unsigned int hweight64(unsigned long w)
+static inline unsigned int __arch_hweight64(unsigned long w)
 {
 	unsigned int res;
 
@@ -52,7 +52,7 @@ static inline unsigned int hweight64(unsigned long w)
 	return res;
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
@@ -60,7 +60,7 @@ static inline unsigned int hweight32(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res;
 
@@ -68,7 +68,7 @@ static inline unsigned int hweight16(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res;
 
@@ -78,9 +78,10 @@ static inline unsigned int hweight8(unsigned int w)
 
 #else
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 
 #endif
+#include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
new file mode 100644
index 0000000..3a7be84
--- /dev/null
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+
+#include <asm/types.h>
+
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
new file mode 100644
index 0000000..fa2a50b
--- /dev/null
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+
+/*
+ * Compile time versions of __arch_hweightN()
+ */
+#define __const_hweight8(w)		\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
+#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
+#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
+
+/*
+ * Generic interface.
+ */
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+
+/*
+ * Interface for known constant arguments
+ */
+#define HWEIGHT8(w)  (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight8(w))
+#define HWEIGHT16(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight16(w))
+#define HWEIGHT32(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight32(w))
+#define HWEIGHT64(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight64(w))
+
+/*
+ * Type invariant interface to the compile time constant hweight functions.
+ */
+#define HWEIGHT(w)   HWEIGHT64((u64)w)
+
+#endif /* _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..a94d651 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -1,11 +1,7 @@
 #ifndef _ASM_GENERIC_BITOPS_HWEIGHT_H_
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
-#include <asm/types.h>
-
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+#include <asm-generic/bitops/arch_hweight.h>
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index b793898..c55d5bc 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -47,31 +47,6 @@ static inline unsigned long hweight_long(unsigned long w)
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate
diff --git a/lib/hweight.c b/lib/hweight.c
index 63ee4eb..a6927e7 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int __arch_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,29 +24,30 @@ unsigned int hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(__arch_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(__arch_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(__arch_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long __arch_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
+	return __arch_hweight32((unsigned int)(w >> 32)) +
+	       __arch_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -63,4 +64,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(__arch_hweight64);
-- 
1.7.0.2

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                   ` H. Peter Anvin
                                                                                       ` (2 preceding siblings ...)
  2010-03-18 11:19                                                                     ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
@ 2010-03-18 11:20                                                                     ` Borislav Petkov
  3 siblings, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Borislav Petkov <borislav.petkov@amd.com>
Date: Fri, 5 Mar 2010 17:34:46 +0100
Subject: [PATCH 2/2] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0eacb1f..89d8c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index b09ec55..67dae51 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -96,6 +93,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 2e152ae..abe63a8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_BTREE) += btree.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
diff --git a/lib/hweight.c b/lib/hweight.c
index a6927e7..3c79d50 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,30 +24,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -64,4 +64,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.7.0.2

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2010-03-18 11:20 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4B6C93A2.1090302@zytor.com>
     [not found] ` <20100206093659.GA28326@aftab>
     [not found]   ` <4B6E1DA3.50204@zytor.com>
     [not found]     ` <20100208092845.GB12618@a1.tnic>
     [not found]       ` <4B6FDAED.9060204@zytor.com>
     [not found]         ` <20100208095945.GA14740@a1.tnic>
     [not found]           ` <20100211172424.GB19779@aftab>
     [not found]             ` <4B743F7D.3090605@zytor.com>
     [not found]               ` <20100212170649.GC3114@aftab>
     [not found]                 ` <4B758FC0.1020600@zytor.com>
     [not found]                   ` <20100212174751.GD3114@aftab>
2010-02-12 19:05                     ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) H. Peter Anvin
2010-02-17 13:57                       ` Michal Marek
2010-02-17 17:20                         ` Borislav Petkov
2010-02-17 17:31                           ` Michal Marek
2010-02-17 17:34                             ` Borislav Petkov
2010-02-17 17:39                             ` Michal Marek
2010-02-18  6:19                               ` Borislav Petkov
2010-02-19 14:22                                 ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
2010-02-19 16:06                                   ` H. Peter Anvin
2010-02-19 16:45                                     ` Borislav Petkov
2010-02-19 16:53                                       ` H. Peter Anvin
2010-02-22 14:17                                         ` Borislav Petkov
2010-02-22 17:21                                           ` H. Peter Anvin
2010-02-22 18:49                                             ` Borislav Petkov
2010-02-22 19:55                                               ` H. Peter Anvin
2010-02-23  6:37                                                 ` Borislav Petkov
2010-02-23 15:58                                                 ` Borislav Petkov
2010-02-23 17:34                                                   ` H. Peter Anvin
2010-02-23 17:54                                                     ` Borislav Petkov
2010-02-23 18:17                                                       ` H. Peter Anvin
2010-02-23 19:06                                                         ` Borislav Petkov
2010-02-26  5:27                                                           ` H. Peter Anvin
2010-02-26  7:47                                                             ` Borislav Petkov
2010-02-26 17:48                                                               ` H. Peter Anvin
2010-02-27  8:28                                                                 ` Borislav Petkov
2010-02-27 20:00                                                                   ` H. Peter Anvin
2010-03-09 15:36                                                                     ` Borislav Petkov
2010-03-09 15:50                                                                       ` Peter Zijlstra
2010-03-09 16:23                                                                         ` Borislav Petkov
2010-03-09 16:32                                                                           ` Peter Zijlstra
2010-03-09 17:32                                                                             ` Borislav Petkov
2010-03-09 17:37                                                                               ` Peter Zijlstra
2010-03-18 11:17                                                                     ` Borislav Petkov
2010-03-18 11:19                                                                     ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
2010-03-18 11:20                                                                     ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
2010-02-18 10:51                         ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
2010-02-18 11:51                           ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).