linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits
@ 2010-01-30  9:45 Wu Fengguang
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
                   ` (4 more replies)
  0 siblings, 5 replies; 81+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, LKML, Al Viro, linux-fsdevel

Andrew,

Resend the patches related to FMODE_* and O_*, rebased to the -mm tree.

O_* and FMODE_NONOTIFY collision fix/check
	[PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number
	[PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
	[PATCH 3/5] vfs: O_* bit numbers uniqueness check

allow negative f_pos for /dev/kmem
	[PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos
	[PATCH 5/5] devmem: dont allow seek to last page

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-02-01 20:44   ` Andrew Morton
  2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, LKML, Eric Paris, Al Viro, linux-fsdevel

[-- Attachment #1: fanotify-bit-fix --]
[-- Type: text/plain, Size: 1181 bytes --]

FMODE_NONOTIFY=0x800000 collides with __O_SYNC in sparc,
so change it to 0x1000000.

CC: Eric Paris <eparis@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/asm-generic/fcntl.h |    2 +-
 include/linux/fs.h          |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- linux-mm.orig/include/asm-generic/fcntl.h	2010-01-13 10:16:27.000000000 +0800
+++ linux-mm/include/asm-generic/fcntl.h	2010-01-30 17:40:48.000000000 +0800
@@ -5,7 +5,7 @@
 
 /*
  * FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x800000
+ * FMODE_NONOTIFY is 0x1000000
  * These cannot be used by userspace O_* until internal and external open
  * flags are split.
  * -Eric Paris
--- linux-mm.orig/include/linux/fs.h	2010-01-30 17:14:14.000000000 +0800
+++ linux-mm/include/linux/fs.h	2010-01-30 17:41:04.000000000 +0800
@@ -91,7 +91,7 @@ struct inodes_stat_t {
 #define FMODE_RANDOM		((__force fmode_t)0x1000)
 
 /* File was opened by fanotify and shouldn't generate fanotify events */
-#define FMODE_NONOTIFY		((__force fmode_t)0x800000)
+#define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
 /*
  * The below are the various read and write types that we support. Some of

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-02-01 20:48   ` Andrew Morton
  2010-01-30  9:45 ` [PATCH 3/5] vfs: O_* bit numbers uniqueness check Wu Fengguang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel

[-- Attachment #1: constant-hweight32.patch --]
[-- Type: text/plain, Size: 872 bytes --]

This allows use of hweight_long() in BUILD_BUG_ON().
Suggested by Jamie.

CC: Jamie Lokier <jamie@shareable.org>
CC: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/bitops.h |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- linux-mm.orig/include/linux/bitops.h	2009-07-20 20:10:19.000000000 +0800
+++ linux-mm/include/linux/bitops.h	2010-01-30 17:41:15.000000000 +0800
@@ -40,10 +40,14 @@ static __inline__ int get_count_order(un
 	return order;
 }
 
-static inline unsigned long hweight_long(unsigned long w)
-{
-	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
-}
+#define hweight_long(x)				\
+(						\
+	__builtin_constant_p(x) ?		\
+		__builtin_popcountl(x) :	\
+		(sizeof(x) <= 4 ?		\
+			 hweight32(x) :		\
+			 hweight64(x))		\
+)
 
 /**
  * rol32 - rotate a 32-bit value left

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 3/5] vfs: O_* bit numbers uniqueness check
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
  2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-01-30  9:45 ` [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos Wu Fengguang
  2010-01-30  9:45 ` [PATCH 5/5] devmem: dont allow seek to last page Wu Fengguang
  4 siblings, 0 replies; 81+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, David Miller, Stephen Rothwell, Al Viro,
	Christoph Hellwig, Eric Paris, Roland Dreier, Jamie Lokier,
	Andreas Schwab, Al Viro, linux-fsdevel

[-- Attachment #1: fcntl-bit-check.patch --]
[-- Type: text/plain, Size: 2105 bytes --]

The O_* bit numbers are defined in 20+ arch/*, and can silently overlap.
Add a compile time check to ensure the uniqueness as suggested by David
Miller.

v4: use the nice hweight_long() (suggested by Jamie)
    split O_SYNC to (__O_SYNC | O_DSYNC) (suggested by Andreas)
    take away the FMODE_* and O_RANDOM bits
v3: change to BUILD_BUG_ON() (suggested by Roland)

CC: David Miller <davem@davemloft.net>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Christoph Hellwig <hch@infradead.org>
CC: Eric Paris <eparis@redhat.com>
CC: Roland Dreier <rdreier@cisco.com>
CC: Jamie Lokier <jamie@shareable.org>
CC: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fcntl.c                  |   14 ++++++++++++--
 include/asm-generic/fcntl.h |    2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

--- linux-mm.orig/fs/fcntl.c	2010-01-30 17:14:13.000000000 +0800
+++ linux-mm/fs/fcntl.c	2010-01-30 17:41:36.000000000 +0800
@@ -741,11 +741,21 @@ void kill_fasync(struct fasync_struct **
 }
 EXPORT_SYMBOL(kill_fasync);
 
-static int __init fasync_init(void)
+static int __init fcntl_init(void)
 {
+	/* please add new bits here to ensure allocation uniqueness */
+	BUILD_BUG_ON(17 != hweight_long(
+		O_RDONLY	| O_WRONLY	| O_RDWR	|
+		O_CREAT		| O_EXCL	| O_NOCTTY	|
+		O_TRUNC		| O_APPEND	| O_NONBLOCK	|
+		__O_SYNC	| O_DSYNC	| FASYNC	|
+		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
+		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC
+		));
+
 	fasync_cache = kmem_cache_create("fasync_cache",
 		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
 	return 0;
 }
 
-module_init(fasync_init)
+module_init(fcntl_init)
--- linux-mm.orig/include/asm-generic/fcntl.h	2010-01-30 17:40:48.000000000 +0800
+++ linux-mm/include/asm-generic/fcntl.h	2010-01-30 17:41:36.000000000 +0800
@@ -4,6 +4,8 @@
 #include <linux/types.h>
 
 /*
+ * When introducing new O_* bits, please check its uniqueness in fcntl_init().
+ *
  * FMODE_EXEC is 0x20
  * FMODE_NONOTIFY is 0x1000000
  * These cannot be used by userspace O_* until internal and external open

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-01-30  9:45 ` [PATCH 3/5] vfs: O_* bit numbers uniqueness check Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-01-30  9:45 ` [PATCH 5/5] devmem: dont allow seek to last page Wu Fengguang
  4 siblings, 0 replies; 81+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Al Viro, Heiko Carstens, KAMEZAWA Hiroyuki,
	linux-fsdevel

[-- Attachment #1: f_pos-fix --]
[-- Type: text/plain, Size: 3728 bytes --]

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, rw_verify_area() checsk f_pos is negative or not. And if
negative, returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
has negative offsets. And we can't do any access via read/write
to the file(device).

So introduce FMODE_NEG_OFFSET to allow negative file offsets.

Changelog: v5->v6
 - use FMODE_NEG_OFFSET (suggested by Al)
 - rebased onto 2.6.33-rc1

Changelog: v4->v5
 - clean up patches dor /dev/mem.
 - rebased onto 2.6.32-rc1

Changelog: v3->v4
 - make changes in mem.c aligned.
 - change __negative_fpos_check() to return int. 
 - fixed bug in "pos" check.
 - added comments.

Changelog: v2->v3
 - fixed bug in rw_verify_area (it cannot be compiled)

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 drivers/char/mem.c |    4 ++++
 fs/proc/base.c     |    2 ++
 fs/read_write.c    |   21 +++++++++++++++++++--
 include/linux/fs.h |    3 +++
 4 files changed, 28 insertions(+), 2 deletions(-)

--- linux-mm.orig/fs/read_write.c	2010-01-22 11:20:29.000000000 +0800
+++ linux-mm/fs/read_write.c	2010-01-30 17:41:51.000000000 +0800
@@ -205,6 +205,20 @@ bad:
 }
 #endif
 
+static int
+__negative_fpos_check(struct file *file, loff_t pos, size_t count)
+{
+	/*
+	 * pos or pos+count is negative here, check overflow.
+	 * too big "count" will be caught in rw_verify_area().
+	 */
+	if ((pos < 0) && (pos + count < pos))
+		return -EOVERFLOW;
+	if (file->f_mode & FMODE_NEG_OFFSET)
+		return 0;
+	return -EINVAL;
+}
+
 /*
  * rw_verify_area doesn't like huge counts. We limit
  * them to something that fits in "int" so that others
@@ -222,8 +236,11 @@ int rw_verify_area(int read_write, struc
 	if (unlikely((ssize_t) count < 0))
 		return retval;
 	pos = *ppos;
-	if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
-		return retval;
+	if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
+		retval = __negative_fpos_check(file, pos, count);
+		if (retval)
+			return retval;
+	}
 
 	if (unlikely(inode->i_flock && mandatory_lock(inode))) {
 		retval = locks_mandatory_area(
--- linux-mm.orig/include/linux/fs.h	2010-01-30 17:41:04.000000000 +0800
+++ linux-mm/include/linux/fs.h	2010-01-30 17:41:51.000000000 +0800
@@ -93,6 +93,9 @@ struct inodes_stat_t {
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
+/* File is huge (eg. /dev/kmem): treat loff_t as unsigned */
+#define FMODE_NEG_OFFSET	((__force fmode_t)0x2000)
+
 /*
  * The below are the various read and write types that we support. Some of
  * them include behavioral modifiers that send information down to the
--- linux-mm.orig/drivers/char/mem.c	2010-01-30 17:23:31.000000000 +0800
+++ linux-mm/drivers/char/mem.c	2010-01-30 17:41:51.000000000 +0800
@@ -882,6 +882,10 @@ static int memory_open(struct inode *ino
 	if (dev->dev_info)
 		filp->f_mapping->backing_dev_info = dev->dev_info;
 
+	/* Is /dev/mem or /dev/kmem ? */
+	if (dev->dev_info == &directly_mappable_cdev_bdi)
+		filp->f_mode |= FMODE_NEG_OFFSET;
+
 	if (dev->fops->open)
 		return dev->fops->open(inode, filp);
 
--- linux-mm.orig/fs/proc/base.c	2010-01-30 17:14:14.000000000 +0800
+++ linux-mm/fs/proc/base.c	2010-01-30 17:41:51.000000000 +0800
@@ -880,6 +880,8 @@ static const struct file_operations proc
 static int mem_open(struct inode* inode, struct file* file)
 {
 	file->private_data = (void*)((long)current->self_exec_id);
+	/* OK to pass negative loff_t, we can catch out-of-range */
+	file->f_mode |= FMODE_NEG_OFFSET;
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 5/5] devmem: dont allow seek to last page
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-01-30  9:45 ` [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  4 siblings, 0 replies; 81+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, LKML, OGAWA Hirofumi, Al Viro, linux-fsdevel

[-- Attachment #1: mem-seek-fix --]
[-- Type: text/plain, Size: 1378 bytes --]

So as to return a uniform error -EOVERFLOW instead of a random one:

# kmem-seek 0xfffffffffffffff0
seek /dev/kmem: Device or resource busy
# kmem-seek 0xfffffffffffffff1
seek /dev/kmem: Block device required

Suggested by OGAWA Hirofumi.

CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 drivers/char/mem.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- linux-mm.orig/drivers/char/mem.c	2010-01-30 17:41:51.000000000 +0800
+++ linux-mm/drivers/char/mem.c	2010-01-30 17:42:29.000000000 +0800
@@ -721,16 +721,23 @@ static loff_t memory_lseek(struct file *
 
 	mutex_lock(&file->f_path.dentry->d_inode->i_mutex);
 	switch (orig) {
-		case 0:
+		case SEEK_CUR:
+			offset += file->f_pos;
+			if ((unsigned long long)offset <
+			    (unsigned long long)file->f_pos) {
+				ret = -EOVERFLOW;
+				break;
+			}
+		case SEEK_SET:
+			/* to avoid userland mistaking f_pos=-9 as -EBADF=-9 */
+			if ((unsigned long long)offset >= ~0xFFFULL) {
+				ret = -EOVERFLOW;
+				break;
+			}
 			file->f_pos = offset;
 			ret = file->f_pos;
 			force_successful_syscall_return();
 			break;
-		case 1:
-			file->f_pos += offset;
-			ret = file->f_pos;
-			force_successful_syscall_return();
-			break;
 		default:
 			ret = -EINVAL;
 	}

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
@ 2010-02-01 20:44   ` Andrew Morton
  0 siblings, 0 replies; 81+ messages in thread
From: Andrew Morton @ 2010-02-01 20:44 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: LKML, Eric Paris, Al Viro, linux-fsdevel

On Sat, 30 Jan 2010 17:45:16 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> FMODE_NONOTIFY=0x800000 collides with __O_SYNC in sparc,
> so change it to 0x1000000.
> 
> CC: Eric Paris <eparis@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/asm-generic/fcntl.h |    2 +-
>  include/linux/fs.h          |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-mm.orig/include/asm-generic/fcntl.h	2010-01-13 10:16:27.000000000 +0800
> +++ linux-mm/include/asm-generic/fcntl.h	2010-01-30 17:40:48.000000000 +0800
> @@ -5,7 +5,7 @@
>  
>  /*
>   * FMODE_EXEC is 0x20
> - * FMODE_NONOTIFY is 0x800000
> + * FMODE_NONOTIFY is 0x1000000
>   * These cannot be used by userspace O_* until internal and external open
>   * flags are split.
>   * -Eric Paris
> --- linux-mm.orig/include/linux/fs.h	2010-01-30 17:14:14.000000000 +0800
> +++ linux-mm/include/linux/fs.h	2010-01-30 17:41:04.000000000 +0800
> @@ -91,7 +91,7 @@ struct inodes_stat_t {
>  #define FMODE_RANDOM		((__force fmode_t)0x1000)
>  
>  /* File was opened by fanotify and shouldn't generate fanotify events */
> -#define FMODE_NONOTIFY		((__force fmode_t)0x800000)
> +#define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
>  
>  /*
>   * The below are the various read and write types that we support. Some of
> 

Eric, please fix up "vfs: introduce FMODE_NONOTIFY" in your tree?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
@ 2010-02-01 20:48   ` Andrew Morton
  2010-02-03 13:39     ` Wu Fengguang
  0 siblings, 1 reply; 81+ messages in thread
From: Andrew Morton @ 2010-02-01 20:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Peter Zijlstra

On Sat, 30 Jan 2010 17:45:17 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> This allows use of hweight_long() in BUILD_BUG_ON().
> Suggested by Jamie.
> 
> CC: Jamie Lokier <jamie@shareable.org>
> CC: Roland Dreier <rdreier@cisco.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/bitops.h |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> --- linux-mm.orig/include/linux/bitops.h	2009-07-20 20:10:19.000000000 +0800
> +++ linux-mm/include/linux/bitops.h	2010-01-30 17:41:15.000000000 +0800
> @@ -40,10 +40,14 @@ static __inline__ int get_count_order(un
>  	return order;
>  }
>  
> -static inline unsigned long hweight_long(unsigned long w)
> -{
> -	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
> -}
> +#define hweight_long(x)				\
> +(						\
> +	__builtin_constant_p(x) ?		\
> +		__builtin_popcountl(x) :	\
> +		(sizeof(x) <= 4 ?		\
> +			 hweight32(x) :		\
> +			 hweight64(x))		\
> +)
>  
>  /**
>   * rol32 - rotate a 32-bit value left

Peter's been mucking with a compile-time HWEIGHT().  An outdated
version of that is presently in linux-next.

(I wonder if it could have used __builtin_popcount)

(I wonder which gcc versions support __builtin_popcount)

Anyway, I suspect you should be using that rather than tweaking
hweight_long().

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-01 20:48   ` Andrew Morton
@ 2010-02-03 13:39     ` Wu Fengguang
  2010-02-03 15:08       ` Andrew Morton
  2010-02-03 17:10       ` H. Peter Anvin
  0 siblings, 2 replies; 81+ messages in thread
From: Wu Fengguang @ 2010-02-03 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin

On Mon, Feb 01, 2010 at 01:48:25PM -0700, Andrew Morton wrote:
> > -static inline unsigned long hweight_long(unsigned long w)
> > -{
> > -	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
> > -}
> > +#define hweight_long(x)				\
> > +(						\
> > +	__builtin_constant_p(x) ?		\
> > +		__builtin_popcountl(x) :	\
> > +		(sizeof(x) <= 4 ?		\
> > +			 hweight32(x) :		\
> > +			 hweight64(x))		\
> > +)
> >  
> >  /**
> >   * rol32 - rotate a 32-bit value left
> 
> Peter's been mucking with a compile-time HWEIGHT().  An outdated
> version of that is presently in linux-next.

Ah I find it. Thanks.

> (I wonder if it could have used __builtin_popcount)
> 
> (I wonder which gcc versions support __builtin_popcount)

This is how Jamie Lokier first recommended it to me:

        Checked GCC 3.4.3 / 4.1 / 4.4: It expands as a compile-time
        constant if the argument is a compile-time constant, so can be
        used in BUILD_BUG_ON() and even for array sizes etc.

Is there an official lowest GCC version that Linux supports?

> Anyway, I suspect you should be using that rather than tweaking
> hweight_long().

OK.

But.. if ever the GCC builtins can be used:

        __builtin_popcount(unsigned int)
        __builtin_popcountl(unsigned long)
        __builtin_popcountll(unsigned long long)

it would be more natural to use hweight_long() or even introduce
hweight_longlong() for (transparent) compile time computation.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 13:39     ` Wu Fengguang
@ 2010-02-03 15:08       ` Andrew Morton
  2010-02-03 15:15         ` Peter Zijlstra
  2010-02-03 17:10       ` H. Peter Anvin
  1 sibling, 1 reply; 81+ messages in thread
From: Andrew Morton @ 2010-02-03 15:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin

On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> Is there an official lowest GCC version that Linux supports?

Documentation/Changes says gcc-3.2.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:08       ` Andrew Morton
@ 2010-02-03 15:15         ` Peter Zijlstra
  2010-02-03 15:42           ` Andrew Morton
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-03 15:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, H. Peter Anvin

On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Is there an official lowest GCC version that Linux supports?
> 
> Documentation/Changes says gcc-3.2.

Anyway, its not like that macro I used isn't straight forward. Not much
a builtin can do better.

The only benefit the builtin has is possibly use machine popcnt
instructions but we already deal with that in the kernel.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:15         ` Peter Zijlstra
@ 2010-02-03 15:42           ` Andrew Morton
  2010-02-03 15:47             ` Peter Zijlstra
  2010-02-03 18:14             ` Borislav Petkov
  0 siblings, 2 replies; 81+ messages in thread
From: Andrew Morton @ 2010-02-03 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, H. Peter Anvin

On Wed, 03 Feb 2010 16:15:57 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> > On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Is there an official lowest GCC version that Linux supports?
> > 
> > Documentation/Changes says gcc-3.2.
> 
> Anyway, its not like that macro I used isn't straight forward. Not much
> a builtin can do better.
> 
> The only benefit the builtin has is possibly use machine popcnt
> instructions but we already deal with that in the kernel.

We didn't deal with it on every architecture, which is something which
the compiler extension takes care of.

In fact I can't find anywhere where we dealt with it on x86.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:42           ` Andrew Morton
@ 2010-02-03 15:47             ` Peter Zijlstra
  2010-02-03 17:11               ` H. Peter Anvin
  2010-02-03 18:14             ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-03 15:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, H. Peter Anvin

On Wed, 2010-02-03 at 07:42 -0800, Andrew Morton wrote:
> On Wed, 03 Feb 2010 16:15:57 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> > > On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Is there an official lowest GCC version that Linux supports?
> > > 
> > > Documentation/Changes says gcc-3.2.
> > 
> > Anyway, its not like that macro I used isn't straight forward. Not much
> > a builtin can do better.
> > 
> > The only benefit the builtin has is possibly use machine popcnt
> > instructions but we already deal with that in the kernel.
> 
> We didn't deal with it on every architecture, which is something which
> the compiler extension takes care of.
> 
> In fact I can't find anywhere where we dealt with it on x86.

>From what I know the popcnt ins on x86 is relatively new, hpa was going
to look at supporting it using alternatives.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 13:39     ` Wu Fengguang
  2010-02-03 15:08       ` Andrew Morton
@ 2010-02-03 17:10       ` H. Peter Anvin
  1 sibling, 0 replies; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-03 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Peter Zijlstra

On 02/03/2010 05:39 AM, Wu Fengguang wrote:
> 
>> (I wonder if it could have used __builtin_popcount)
>>
>> (I wonder which gcc versions support __builtin_popcount)
> 

I think 4.0 or later, but it often generates calls to libgcc, which
would mean adding libgcc functions to the kernel (which some arches
already do.)

> This is how Jamie Lokier first recommended it to me:
> 
>         Checked GCC 3.4.3 / 4.1 / 4.4: It expands as a compile-time
>         constant if the argument is a compile-time constant, so can be
>         used in BUILD_BUG_ON() and even for array sizes etc.
> 
> Is there an official lowest GCC version that Linux supports?

3.4, I believe.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:47             ` Peter Zijlstra
@ 2010-02-03 17:11               ` H. Peter Anvin
  0 siblings, 0 replies; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-03 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar

On 02/03/2010 07:47 AM, Peter Zijlstra wrote:
> 
> From what I know the popcnt ins on x86 is relatively new, hpa was going
> to look at supporting it using alternatives.
> 

And kernel-specific stuff like alternatives is pretty much the one thing
that gcc *can't* do for us.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:42           ` Andrew Morton
  2010-02-03 15:47             ` Peter Zijlstra
@ 2010-02-03 18:14             ` Borislav Petkov
  2010-02-03 18:47               ` Peter Zijlstra
  1 sibling, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-03 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar,
	H. Peter Anvin

On Wed, Feb 03, 2010 at 07:42:51AM -0800, Andrew Morton wrote:
> We didn't deal with it on every architecture, which is something which
> the compiler extension takes care of.
> 
> In fact I can't find anywhere where we dealt with it on x86.

Yeah, we talked briefly about using hardware popcnt, see thread
beginning at

http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-06/msg00245.html

for example. I did an ftrace of the cpumask_weight() calls in sched.c to
see whether there would be a measurable performance gain but it didn't
seem so at the time. My numbers said something like ca. 170 hweight
calls per second and since the <lib/hweight.c> implementations roughly
translate to something like ~20 isns (hweight64 to about ~30), the whole
thing wasn't worth the trouble considering checking binutils versions
and slapping opcodes or using gcc intrinsics which involves gcc version
checking.

An alternatives solution which is based on CPUID flag could add the
popcnt opcode without checking any toolchain versions but how is the
replaced instruction going to look like? Something like

alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)

by making sure the arg is in some register first?

Hmm..

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 18:14             ` Borislav Petkov
@ 2010-02-03 18:47               ` Peter Zijlstra
  2010-02-03 19:49                 ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-03 18:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar,
	H. Peter Anvin

On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:

> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)

Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
and hweight64() needs a bit of magic for 32bit, but yes, something like
that ought to work nicely.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 18:47               ` Peter Zijlstra
@ 2010-02-03 19:49                 ` H. Peter Anvin
  2010-02-04 15:10                   ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-03 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar

On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> 
>> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> 
> Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> and hweight64() needs a bit of magic for 32bit, but yes, something like
> that ought to work nicely.
> 

Arguably the "best" option is to have the alternative being a jump to an
out-of-line stub which does the necessary parameter marshalling before
calling a stub.  This technique is already used in a few other places.

	-hpa

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 19:49                 ` H. Peter Anvin
@ 2010-02-04 15:10                   ` Borislav Petkov
  2010-02-04 15:13                     ` Peter Zijlstra
                                       ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-04 15:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar

On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> > 
> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> > 
> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> > and hweight64() needs a bit of magic for 32bit, but yes, something like
> > that ought to work nicely.
> > 
> 
> Arguably the "best" option is to have the alternative being a jump to an
> out-of-line stub which does the necessary parameter marshalling before
> calling a stub.  This technique is already used in a few other places.

Ok, here's a first alpha prototype and completely untested. The asm
output looks ok though. I've added separate 32-bit and 64-bit helpers in
order to dispense with the if-else tests. The hw-popcnt versions are the
opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
so %rAX has to be preloaded with the bitmask and the computed value
has to be retrieved from there afterwards. And yes, it looks not that
elegant so I'm open for suggestions.

The good thing is, this should work on any toolchain since we don't rely
on the compiler to know about popcnt and we're protected by CPUID flag
so that the hw-popcnt version is used only on processors which support
it.

Please take a good look and let me know what do you guys think.

Thanks.

--
 arch/x86/include/asm/bitops.h |    4 ++
 arch/x86/lib/Makefile         |    2 +-
 arch/x86/lib/popcnt.c         |   62 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/lib/popcnt.c

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..deb5013 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -434,6 +434,10 @@ static inline int fls(int x)
 #endif
 	return r + 1;
 }
+
+
+extern int arch_hweight_long(unsigned long);
+
 #endif /* __KERNEL__ */
 
 #undef ADDR
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..c03fe2d 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
new file mode 100644
index 0000000..179a6e8
--- /dev/null
+++ b/arch/x86/lib/popcnt.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+
+int _hweight32(void)
+{
+	unsigned long w;
+
+	asm volatile("" : "=a" (w));
+
+	return hweight32(w);
+}
+
+int _hweight64(void)
+{
+	unsigned long w;
+
+	asm volatile("" : "=a" (w));
+
+	return hweight64(w);
+}
+
+int _popcnt32(void)
+{
+
+	unsigned long w;
+
+	asm volatile(".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0\n\t"
+			: "=a" (w));
+
+	return w;
+}
+
+int _popcnt64(void)
+{
+
+	unsigned long w;
+
+	asm volatile(".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t."
+		     "byte 0xb8\n\t.byte 0xc0\n\t"
+		     : "=a" (w));
+
+	return w;
+}
+
+int arch_hweight_long(unsigned long w)
+{
+	if (sizeof(w) == 4) {
+		asm volatile("movl %[w], %%eax" :: [w] "r" (w));
+		alternative("call _hweight32",
+			    "call _popcnt32",
+			    X86_FEATURE_POPCNT);
+		asm volatile("" : "=a" (w));
+
+	} else {
+		asm volatile("movq %[w], %%rax" :: [w] "r" (w));
+		alternative("call _hweight64",
+			    "call _popcnt64",
+			    X86_FEATURE_POPCNT);
+		asm volatile("" : "=a" (w));
+	}
+	return w;
+}
-- 
1.6.6

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:10                   ` Borislav Petkov
@ 2010-02-04 15:13                     ` Peter Zijlstra
  2010-02-04 15:54                       ` Borislav Petkov
  2010-02-04 15:16                     ` H. Peter Anvin
  2010-02-04 15:39                     ` Brian Gerst
  2 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-04 15:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar

On Thu, 2010-02-04 at 16:10 +0100, Borislav Petkov wrote:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
> > On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> > > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> > > 
> > >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> > > 
> > > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> > > and hweight64() needs a bit of magic for 32bit, but yes, something like
> > > that ought to work nicely.
> > > 
> > 
> > Arguably the "best" option is to have the alternative being a jump to an
> > out-of-line stub which does the necessary parameter marshalling before
> > calling a stub.  This technique is already used in a few other places.
> 
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
> 
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
> 
> Please take a good look and let me know what do you guys think.

> +int arch_hweight_long(unsigned long w)
> +{
> +	if (sizeof(w) == 4) {
> +		asm volatile("movl %[w], %%eax" :: [w] "r" (w));
> +		alternative("call _hweight32",
> +			    "call _popcnt32",
> +			    X86_FEATURE_POPCNT);
> +		asm volatile("" : "=a" (w));
> +
> +	} else {
> +		asm volatile("movq %[w], %%rax" :: [w] "r" (w));
> +		alternative("call _hweight64",
> +			    "call _popcnt64",
> +			    X86_FEATURE_POPCNT);
> +		asm volatile("" : "=a" (w));
> +	}
> +	return w;
> +}

hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
are.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:10                   ` Borislav Petkov
  2010-02-04 15:13                     ` Peter Zijlstra
@ 2010-02-04 15:16                     ` H. Peter Anvin
  2010-02-04 15:39                     ` Brian Gerst
  2 siblings, 0 replies; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-04 15:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar

On 02/04/2010 07:10 AM, Borislav Petkov wrote:
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub.  This technique is already used in a few other places.
> 
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
> 
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
> 
> Please take a good look and let me know what do you guys think.
> 

I think you misunderstood me.  The idea was to use the alternatives
mechanism to put the popcnt instruction inline instead of the call.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:10                   ` Borislav Petkov
  2010-02-04 15:13                     ` Peter Zijlstra
  2010-02-04 15:16                     ` H. Peter Anvin
@ 2010-02-04 15:39                     ` Brian Gerst
  2 siblings, 0 replies; 81+ messages in thread
From: Brian Gerst @ 2010-02-04 15:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML,
	Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar

2010/2/4 Borislav Petkov <bp@amd64.org>:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
>> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
>> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
>> >
>> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
>> >
>> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
>> > and hweight64() needs a bit of magic for 32bit, but yes, something like
>> > that ought to work nicely.
>> >
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub.  This technique is already used in a few other places.
>
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
>
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
>
> Please take a good look and let me know what do you guys think.
>
> Thanks.
>
> --
>  arch/x86/include/asm/bitops.h |    4 ++
>  arch/x86/lib/Makefile         |    2 +-
>  arch/x86/lib/popcnt.c         |   62 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 67 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/lib/popcnt.c
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 02b47a6..deb5013 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -434,6 +434,10 @@ static inline int fls(int x)
>  #endif
>        return r + 1;
>  }
> +
> +
> +extern int arch_hweight_long(unsigned long);
> +
>  #endif /* __KERNEL__ */
>
>  #undef ADDR
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index cffd754..c03fe2d 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
>  lib-y += memcpy_$(BITS).o
>  lib-$(CONFIG_KPROBES) += insn.o inat.o
>
> -obj-y += msr.o msr-reg.o msr-reg-export.o
> +obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o
>
>  ifeq ($(CONFIG_X86_32),y)
>         obj-y += atomic64_32.o
> diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
> new file mode 100644
> index 0000000..179a6e8
> --- /dev/null
> +++ b/arch/x86/lib/popcnt.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/bitops.h>
> +
> +int _hweight32(void)
> +{
> +       unsigned long w;
> +
> +       asm volatile("" : "=a" (w));
> +
> +       return hweight32(w);
> +

You should just use a normal parameter here (use %rdi for 64-bit).
The way you have it written isn't guaranteed to work.  GCC could
clobber %eax/%rax before you read it, although it is highly unlikely.
For a good example of how this should be written, look at
__mutex_fastpath_lock().

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:13                     ` Peter Zijlstra
@ 2010-02-04 15:54                       ` Borislav Petkov
  2010-02-04 16:04                         ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-04 15:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar

On Thu, Feb 04, 2010 at 04:13:52PM +0100, Peter Zijlstra wrote:
> hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
> are.

Yeah, I'm still looking for the proper location. hweight_long() is the
generic version so do we want to do the

#ifndef __HAVE_ARCH_POPCNT
static inline unsigned long hweight_long(unsigned long w)
...

#endif

thing and define a x86-specific version?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:54                       ` Borislav Petkov
@ 2010-02-04 16:04                         ` Peter Zijlstra
  2010-02-05 12:11                           ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-04 16:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar

On Thu, 2010-02-04 at 16:54 +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2010 at 04:13:52PM +0100, Peter Zijlstra wrote:
> > hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
> > are.
> 
> Yeah, I'm still looking for the proper location. hweight_long() is the
> generic version so do we want to do the
> 
> #ifndef __HAVE_ARCH_POPCNT
> static inline unsigned long hweight_long(unsigned long w)
> ....
> 
> #endif
> 
> thing and define a x86-specific version? 

No, just don't touch hweight_long(), simply provide
__arch_hweight{8,16,32,64} and all will be well.

hweight_long() is provided by <include/linux.h> and is constructed from
height32() and hweight64() depending on the actual word size.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 16:04                         ` Peter Zijlstra
@ 2010-02-05 12:11                           ` Borislav Petkov
  2010-02-05 12:14                             ` Peter Zijlstra
  2010-02-05 21:54                             ` H. Peter Anvin
  0 siblings, 2 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-05 12:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar, Brian Gerst

On Thu, Feb 04, 2010 at 05:04:17PM +0100, Peter Zijlstra wrote:
> No, just don't touch hweight_long(), simply provide
> __arch_hweight{8,16,32,64} and all will be well.

Ok, another day, another version :)

It is, of course, completely untested but it builds and the asm looks
ok. I think I've addressed all concerns so far.

--
 arch/x86/include/asm/hweight.h       |   14 ++++++++
 arch/x86/lib/Makefile                |    2 +-
 arch/x86/lib/hweight.c               |   57 ++++++++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h |   45 ++++++++++++++++++++++++--
 lib/hweight.c                        |   16 +++++-----
 5 files changed, 121 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/include/asm/hweight.h
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/include/asm/hweight.h b/arch/x86/include/asm/hweight.h
new file mode 100644
index 0000000..762125f
--- /dev/null
+++ b/arch/x86/include/asm/hweight.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight64 __arch_hweight64
+
+extern unsigned int __arch_hweight8(unsigned int);
+extern unsigned int __arch_hweight16(unsigned int);
+extern unsigned int __arch_hweight32(unsigned int);
+extern unsigned long __arch_hweight64(__u64);
+
+#endif /*_ASM_X86_HWEIGHT_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..3cf51c8
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,57 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define __arch_hweight_alt(size)	\
+	ALTERNATIVE("call _hweight"#size, POPCNT##size, X86_FEATURE_POPCNT)
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm volatile("xor %%dh, %%dh\n\t"
+		     __arch_hweight_alt(32)
+			: "=di" (res)
+			: "di" (w)
+			: "ecx", "memory");
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight16(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm volatile(__arch_hweight_alt(32)
+			: "=di" (res)
+			: "di" (w)
+			: "ecx", "memory");
+
+	return res;
+
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned int res = 0;
+
+	asm volatile(__arch_hweight_alt(64)
+			: "=di" (res)
+			: "di" (w)
+			: "rsi", "rcx", "r8", "r9", "r10", "r11",
+			  "memory");
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..340ad4e 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -2,10 +2,47 @@
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
 #include <asm/types.h>
+#include <asm/hweight.h>
 
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+extern unsigned int _hweight32(unsigned int w);
+extern unsigned int _hweight16(unsigned int w);
+extern unsigned int _hweight8(unsigned int w);
+extern unsigned long _hweight64(__u64 w);
+
+static inline unsigned int hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+	return __arch_hweight8(w);
+#else
+	return _hweight8(w);
+#endif
+}
+
+static inline unsigned int hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+	return __arch_hweight16(w);
+#else
+	return _hweight16(w);
+#endif
+}
+
+static inline unsigned int hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+	return __arch_hweight32(w);
+#else
+	return _hweight32(w);
+#endif
+}
+
+static inline unsigned long hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+	return __arch_hweight64(w);
+#else
+	return _hweight64(w);
+#endif
+}
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/hweight.c b/lib/hweight.c
index 389424e..f7b81a1 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int _hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,26 +17,26 @@ unsigned int hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int _hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int _hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long _hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
 	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
@@ -56,4 +56,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(_hweight64);
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-05 12:11                           ` Borislav Petkov
@ 2010-02-05 12:14                             ` Peter Zijlstra
  2010-02-05 21:54                             ` H. Peter Anvin
  1 sibling, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-05 12:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar, Brian Gerst

On Fri, 2010-02-05 at 13:11 +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2010 at 05:04:17PM +0100, Peter Zijlstra wrote:
> > No, just don't touch hweight_long(), simply provide
> > __arch_hweight{8,16,32,64} and all will be well.
> 
> Ok, another day, another version :)
> 
> It is, of course, completely untested but it builds and the asm looks
> ok. I think I've addressed all concerns so far.

please work against a tree that has:

  http://lkml.org/lkml/2010/2/4/119

in.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-05 12:11                           ` Borislav Petkov
  2010-02-05 12:14                             ` Peter Zijlstra
@ 2010-02-05 21:54                             ` H. Peter Anvin
  2010-02-06  9:36                               ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-05 21:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar, Brian Gerst

On 02/05/2010 04:11 AM, Borislav Petkov wrote:
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm volatile("xor %%dh, %%dh\n\t"
> +		     __arch_hweight_alt(32)
> +			: "=di" (res)
> +			: "di" (w)
> +			: "ecx", "memory");
> +

This is wrong in more ways than I can shake a stick at.

a) "di" doesn't mean the DI register - it means the DX register (d) or
an immediate (i).  Since you don't have any reference to either %0 or %1
in your code, you have no way of knowing which one it is.  The
constraint for the di register is "D".

b) On 32 bits, the first argument register is in %eax (with %edx used
for the upper half of a 32-bit argument), but on 64 bits, the first
argument is in %rdi, with the return still in %rax.

c) You call a C function, but you don't clobber the set of registers
that a C function would clobber.  You either need to put the function in
an assembly wrapper (which is better in the long run), or clobber the
full set of registers that is clobbered by a C function (which is better
in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.

d) On the other hand, you do *not* need a "memory" clobber.

	-hpa

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-05 21:54                             ` H. Peter Anvin
@ 2010-02-06  9:36                               ` Borislav Petkov
  2010-02-07  1:55                                 ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-06  9:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar, Brian Gerst

On Fri, Feb 05, 2010 at 01:54:42PM -0800, H. Peter Anvin wrote:
> On 02/05/2010 04:11 AM, Borislav Petkov wrote:
> > +
> > +unsigned int __arch_hweight16(unsigned int w)
> > +{
> > +	unsigned int res = 0;
> > +
> > +	asm volatile("xor %%dh, %%dh\n\t"
> > +		     __arch_hweight_alt(32)
> > +			: "=di" (res)
> > +			: "di" (w)
> > +			: "ecx", "memory");
> > +
> 
> This is wrong in more ways than I can shake a stick at.

Thanks for reviewing it though - how else would I learn :).

> a) "di" doesn't mean the DI register - it means the DX register (d) or
> an immediate (i).  Since you don't have any reference to either %0 or %1
> in your code, you have no way of knowing which one it is.  The
> constraint for the di register is "D".

right.

> b) On 32 bits, the first argument register is in %eax (with %edx used
> for the upper half of a 32-bit argument), but on 64 bits, the first
> argument is in %rdi, with the return still in %rax.

Sure, it is right there in arch/x86/include/asm/calling.h. Shame on me.

> c) You call a C function, but you don't clobber the set of registers
> that a C function would clobber.  You either need to put the function in
> an assembly wrapper (which is better in the long run), or clobber the
> full set of registers that is clobbered by a C function (which is better
> in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
> esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.

I think you mean rsi instead of esi here.

Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
the full set of clobbered registers. Please elaborate on the assembly
wrapper for the function, wouldn't I need to list all the clobbered
registers there too or am I missing something?

> d) On the other hand, you do *not* need a "memory" clobber.

Right, in this case we have all non-barrier like inlines so no memory
clobber, according to the comment above alternative() macro.

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-06  9:36                               ` Borislav Petkov
@ 2010-02-07  1:55                                 ` H. Peter Anvin
  2010-02-08  9:28                                   ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-07  1:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel@vger.kernel.org,
	Ingo Molnar, Brian Gerst

On 02/06/2010 01:36 AM, Borislav Petkov wrote:
> 
>> c) You call a C function, but you don't clobber the set of registers
>> that a C function would clobber.  You either need to put the function in
>> an assembly wrapper (which is better in the long run), or clobber the
>> full set of registers that is clobbered by a C function (which is better
>> in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
>> esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.
> 
> I think you mean rsi instead of esi here.
> 
> Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
> the full set of clobbered registers. Please elaborate on the assembly
> wrapper for the function, wouldn't I need to list all the clobbered
> registers there too or am I missing something?
> 

The notion there would be that you do push/pop in the assembly wrapper.

>> d) On the other hand, you do *not* need a "memory" clobber.
> 
> Right, in this case we have all non-barrier like inlines so no memory
> clobber, according to the comment above alternative() macro.

OK, I'm missing something here.

A few more notions:

a. This is exactly the kind of code where you don't want to put
"volatile" on your asm statement, because it's a pure compute.

b. It is really rather pointless to go through the whole alternatives
work if you are then going to put it inside a function which isn't an
inline ...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-07  1:55                                 ` H. Peter Anvin
@ 2010-02-08  9:28                                   ` Borislav Petkov
  2010-02-08  9:35                                     ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-08  9:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Sat, Feb 06, 2010 at 05:55:47PM -0800, H. Peter Anvin wrote:
> > Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
> > the full set of clobbered registers. Please elaborate on the assembly
> > wrapper for the function, wouldn't I need to list all the clobbered
> > registers there too or am I missing something?
> > 
> 
> The notion there would be that you do push/pop in the assembly wrapper.

Oh yes, something similar to SAVE/RESTORE_ALL in
<arch/x86/kernel/entry_32.S> could work. Good idea!

> >> d) On the other hand, you do *not* need a "memory" clobber.
> > 
> > Right, in this case we have all non-barrier like inlines so no memory
> > clobber, according to the comment above alternative() macro.
> 
> OK, I'm missing something here.
> 
> A few more notions:
> 
> a. This is exactly the kind of code where you don't want to put
> "volatile" on your asm statement, because it's a pure compute.
> 
> b. It is really rather pointless to go through the whole alternatives
> work if you are then going to put it inside a function which isn't an
> inline ...

Well, in the second version I did replace a 'call _hweightXX' with
the actual popcnt opcode so the alternatives is only needed to do the
replacement during boot. We might just as well do

if (X86_FEATURE_POPCNT)
	__hw_popcnt()
else
	__software_hweight()

The only advantage of the alternatives is that it would save us the
if-else test above each time we do cpumask_weight. However, the if-else
approach is much more readable and obviates the need for all that macro
magic and taking special care of calling c function from within asm. And
since we do not call cpumask_weight all that often I'll honestly opt for
alternative-less solution...

Hmm...

Thanks,
Boris.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-08  9:28                                   ` Borislav Petkov
@ 2010-02-08  9:35                                     ` H. Peter Anvin
  2010-02-08  9:59                                       ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-08  9:35 UTC (permalink / raw)
  To: Borislav Petkov, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang <fengguang.w

On 02/08/2010 01:28 AM, Borislav Petkov wrote:
>
> Well, in the second version I did replace a 'call _hweightXX' with
> the actual popcnt opcode so the alternatives is only needed to do the
> replacement during boot. We might just as well do
>
> if (X86_FEATURE_POPCNT)
> 	__hw_popcnt()
> else
> 	__software_hweight()
>
> The only advantage of the alternatives is that it would save us the
> if-else test above each time we do cpumask_weight. However, the if-else
> approach is much more readable and obviates the need for all that macro
> magic and taking special care of calling c function from within asm. And
> since we do not call cpumask_weight all that often I'll honestly opt for
> alternative-less solution...
>

The highest performance will be gotten by alternatives, but it only make 
sense if they are inlined at the point of use... otherwise it's 
basically pointless.

	-hpa

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-08  9:35                                     ` H. Peter Anvin
@ 2010-02-08  9:59                                       ` Borislav Petkov
  2010-02-11 17:24                                         ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-08  9:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Mon, Feb 08, 2010 at 01:35:41AM -0800, H. Peter Anvin wrote:
> On 02/08/2010 01:28 AM, Borislav Petkov wrote:
> >
> >Well, in the second version I did replace a 'call _hweightXX' with
> >the actual popcnt opcode so the alternatives is only needed to do the
> >replacement during boot. We might just as well do
> >
> >if (X86_FEATURE_POPCNT)
> >	__hw_popcnt()
> >else
> >	__software_hweight()
> >
> >The only advantage of the alternatives is that it would save us the
> >if-else test above each time we do cpumask_weight. However, the if-else
> >approach is much more readable and obviates the need for all that macro
> >magic and taking special care of calling c function from within asm. And
> >since we do not call cpumask_weight all that often I'll honestly opt for
> >alternative-less solution...
> >
> 
> The highest performance will be gotten by alternatives, but it only
> make sense if they are inlined at the point of use... otherwise it's
> basically pointless.

The popcnt-replacement part of the alternative would be as fast as
possible since we're adding the opcode there but the slow version would
add the additional overhead of saving/restoring the registers before
calling the software hweight implementation. I'll do some tracing to see
what a change like that would cost on machines which don't have popcnt.

Let me prep another version when I get back on Wed. (currently
travelling) with all the stuff we discussed to see how it would turn.

Thanks,
Boris.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-08  9:59                                       ` Borislav Petkov
@ 2010-02-11 17:24                                         ` Borislav Petkov
  2010-02-11 17:33                                           ` H. Peter Anvin
  2010-02-14 10:12                                           ` Peter Zijlstra
  0 siblings, 2 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-11 17:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> Let me prep another version when I get back on Wed. (currently
> travelling) with all the stuff we discussed to see how it would turn.

Ok, here's another version ontop of PeterZ's patch at
http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
while on 64-bit I do "popcnt %rdi, %rdi".

I also did some rudimentary tracing with the function graph tracer of
all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
compile and the preliminary results show that hweight in software takes
about 9.768 usecs the longest while the hardware popcnt about 8.515
usecs. The machine is a Fam10 revB2 quadcore.

What remains to be done is see whether the saving/restoring of
callee-clobbered regs with this patch has any noticeable negative
effects on the software hweight case on machines which don't support
popcnt. Also, I'm open for better tracing ideas :).

Thanks.

---
 arch/alpha/include/asm/bitops.h            |    4 +
 arch/ia64/include/asm/bitops.h             |    1 +
 arch/sparc/include/asm/bitops_64.h         |    4 +
 arch/x86/include/asm/bitops.h              |   10 +++
 arch/x86/lib/Makefile                      |    2 +-
 arch/x86/lib/hweight.c                     |   96 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h  |    8 +-
 include/asm-generic/bitops/const_hweight.h |   43 +++++++++++-
 lib/hweight.c                              |   20 +++---
 9 files changed, 169 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_weight32(unsigned int w)
 {
 	return __arch_hweight64(w);
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	return __arch_hweight64(w & 0xffff);
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	return __arch_hweight64(w & 0xff);
 }
+#define __arch_hweight8 __arch_hweight8
 #else
 #include <asm-generic/bitops/arch_hweight.h>
 #endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
 	result = ia64_popcnt(x);
 	return result;
 }
+#define __arch_hweight64 __arch_hweight64
 
 #define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
 #define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
 	return res;
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_hweight32(unsigned int w)
 {
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
 	return res;
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
 	return res;
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
 	return res;
 }
+#define __arch_hweight8 __arch_hweight8
 
 #else
 
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
 #include <asm-generic/bitops/hweight.h>
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..4a1c5c9
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,96 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+/* popcnt %eax, %eax */
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+
+/* popcnt %rdi, %rdi */
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define PUSH_CLOBBERED_32	\
+	"pushl %%edx\n\t"	\
+	"pushl %%ecx\n\t"
+
+#define POP_CLOBBERED_32	\
+	"\n\t"			\
+	"popl %%ecx\n\t"	\
+	"popl %%edx\n\t"
+
+#define PUSH_CLOBBERED_64	\
+	"pushq %%rax\n\t"	\
+	"pushq %%rsi\n\t"	\
+	"pushq %%rdx\n\t"	\
+	"pushq %%rcx\n\t"	\
+	"pushq %%r8\n\t"	\
+	"pushq %%r9\n\t"	\
+	"pushq %%r10\n\t"	\
+	"pushq %%r11\n\t"
+
+#define POP_CLOBBERED_64	\
+	"\n\t"			\
+	"popq %%r11\n\t"	\
+	"popq %%r10\n\t"	\
+	"popq %%r9\n\t"		\
+	"popq %%r8\n\t"		\
+	"popq %%rcx\n\t"	\
+	"popq %%rdx\n\t"	\
+	"popq %%rsi\n\t"	\
+	"popq %%rax\n\t"
+
+#ifdef CONFIG_64BIT
+#define PUSH_CLOBBERED PUSH_CLOBBERED_64
+#define POP_CLOBBERED  POP_CLOBBERED_64
+#define POPCNT POPCNT64
+#define ARG0 "D"
+#else
+#define PUSH_CLOBBERED PUSH_CLOBBERED_32
+#define POP_CLOBBERED  POP_CLOBBERED_32
+#define POPCNT POPCNT32
+#define ARG0 "a"
+#endif
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm volatile(PUSH_CLOBBERED
+		     ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     POP_CLOBBERED
+		     : "="ARG0 (res)
+		     : ARG0 (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm volatile(PUSH_CLOBBERED
+		     ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		      POP_CLOBBERED
+		     : "="ARG0 (res)
+		     : ARG0 (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
 #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
 #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
 
+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+       return __arch_hweight8(w);
+#else
+       return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+       return __arch_hweight16(w);
+#else
+       return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+       return __arch_hweight32(w);
+#else
+       return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+       return __arch_hweight64(w);
+#else
+       return __sw_hweight64(w);
+#endif
+}
 /*
  * Generic interface.
  */
-#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))
 
 /*
  * Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
-- 
1.6.6.1


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-11 17:24                                         ` Borislav Petkov
@ 2010-02-11 17:33                                           ` H. Peter Anvin
  2010-02-12 17:06                                             ` Borislav Petkov
  2010-02-14 10:12                                           ` Peter Zijlstra
  1 sibling, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-11 17:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/11/2010 09:24 AM, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
>> Let me prep another version when I get back on Wed. (currently
>> travelling) with all the stuff we discussed to see how it would turn.
> 
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

On 64 bits it should be "popcnt %rdi, %rax".

> I also did some rudimentary tracing with the function graph tracer of
> all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
> compile and the preliminary results show that hweight in software takes
> about 9.768 usecs the longest while the hardware popcnt about 8.515
> usecs. The machine is a Fam10 revB2 quadcore.
> 
> What remains to be done is see whether the saving/restoring of
> callee-clobbered regs with this patch has any noticeable negative
> effects on the software hweight case on machines which don't support
> popcnt. Also, I'm open for better tracing ideas :).
> 
> +	asm volatile(PUSH_CLOBBERED
> +		     ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> +		      POP_CLOBBERED
> +		     : "="ARG0 (res)
> +		     : ARG0 (w));


Sorry, no.

You don't do the push/pop inline -- if you're going to take the hit of
pushing this into the caller, it's better to list them as explicit
clobbers and let the compiler figure out how to do it.  The point of
doing an explicit push/pop is that it can be pushed into the out-of-line
subroutine.

Furthermore, you're still putting "volatile" on there... this is a pure
computation -- no side effects -- so it is exactly when you *shouldn't*
declare your asm statement volatile.

Note: given how simple and regular a popcnt actually is, it might be
preferrable to have the out-of-line implementation either in assembly,
or using gcc's -fcall-saved-* options to reduce the number of registers
that is clobbered by the routine.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-11 17:33                                           ` H. Peter Anvin
@ 2010-02-12 17:06                                             ` Borislav Petkov
  2010-02-12 17:28                                               ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-12 17:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> You don't do the push/pop inline -- if you're going to take the hit of
> pushing this into the caller, it's better to list them as explicit
> clobbers and let the compiler figure out how to do it.  The point of
> doing an explicit push/pop is that it can be pushed into the out-of-line
> subroutine.

Doh! Of course, this was wrong. See below.

> Furthermore, you're still putting "volatile" on there... this is a pure
> computation -- no side effects -- so it is exactly when you *shouldn't*
> declare your asm statement volatile.

Also gone.

> Note: given how simple and regular a popcnt actually is, it might be
> preferrable to have the out-of-line implementation either in assembly,
> or using gcc's -fcall-saved-* options to reduce the number of registers
> that is clobbered by the routine.

Yep, this is actually a very good idea. And it seems to work, I've
tested it on an F10, K8 and even on an Atom and no blow ups so far...

---
 arch/alpha/include/asm/bitops.h            |    4 ++
 arch/ia64/include/asm/bitops.h             |    1 +
 arch/sparc/include/asm/bitops_64.h         |    4 ++
 arch/x86/include/asm/bitops.h              |   10 ++++
 arch/x86/lib/Makefile                      |    8 +++-
 arch/x86/lib/hweight.c                     |   72 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h  |    8 ++--
 include/asm-generic/bitops/const_hweight.h |   43 +++++++++++++++--
 lib/hweight.c                              |   20 ++++----
 9 files changed, 151 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_weight32(unsigned int w)
 {
 	return __arch_hweight64(w);
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	return __arch_hweight64(w & 0xffff);
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	return __arch_hweight64(w & 0xff);
 }
+#define __arch_hweight8 __arch_hweight8
 #else
 #include <asm-generic/bitops/arch_hweight.h>
 #endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
 	result = ia64_popcnt(x);
 	return result;
 }
+#define __arch_hweight64 __arch_hweight64
 
 #define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
 #define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
 	return res;
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_hweight32(unsigned int w)
 {
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
 	return res;
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
 	return res;
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
 	return res;
 }
+#define __arch_hweight8 __arch_hweight8
 
 #else
 
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
 #include <asm-generic/bitops/hweight.h>
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..ddc32eb 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,9 +22,11 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
+	CFLAGS_hweight.o = -fcall-saved-ecx -fcall-saved-edx
+
         obj-y += atomic64_32.o
         lib-y += checksum_32.o
         lib-y += strstr_32.o
@@ -34,6 +36,10 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
 endif
         lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
 else
+	CFLAGS_hweight.o = -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx \
+			   -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 \
+			   -fcall-saved-r11
+
         obj-y += io_64.o iomap_copy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
         lib-y += thunk_64.o clear_page_64.o copy_page_64.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..9b4bfa5
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,72 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * Those are called out-of-line in the alternative below and are added here only
+ * so that gcc is able to figure out which registers have been clobbered by
+ * __sw_hweightXX so that it could restore their values before returning from
+ * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
+ */
+unsigned int _sw_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
+
+unsigned long _sw_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call _sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
 #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
 #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
 
+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+       return __arch_hweight8(w);
+#else
+       return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+       return __arch_hweight16(w);
+#else
+       return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+       return __arch_hweight32(w);
+#else
+       return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+       return __arch_hweight64(w);
+#else
+       return __sw_hweight64(w);
+#endif
+}
 /*
  * Generic interface.
  */
-#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))
 
 /*
  * Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
-- 
1.6.6.1


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 17:06                                             ` Borislav Petkov
@ 2010-02-12 17:28                                               ` H. Peter Anvin
  2010-02-12 17:47                                                 ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-12 17:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
>> You don't do the push/pop inline -- if you're going to take the hit of
>> pushing this into the caller, it's better to list them as explicit
>> clobbers and let the compiler figure out how to do it.  The point of
>> doing an explicit push/pop is that it can be pushed into the out-of-line
>> subroutine.
> 
> Doh! Of course, this was wrong. See below.
> 
> diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> new file mode 100644
> index 0000000..9b4bfa5
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,72 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * Those are called out-of-line in the alternative below and are added here only
> + * so that gcc is able to figure out which registers have been clobbered by
> + * __sw_hweightXX so that it could restore their values before returning from
> + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> + */
> +unsigned int _sw_hweight32(unsigned int w)
> +{
> +	return __sw_hweight32(w);
> +}
> +
> +unsigned long _sw_hweight64(__u64 w)
> +{
> +	return __sw_hweight64(w);
> +}
> +
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +

Now you have no clobbers and no assembly wrapper.  That really doesn't work.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 17:28                                               ` H. Peter Anvin
@ 2010-02-12 17:47                                                 ` Borislav Petkov
  2010-02-12 19:05                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-12 17:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Fri, Feb 12, 2010 at 09:28:32AM -0800, H. Peter Anvin wrote:
> On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> > On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> >> You don't do the push/pop inline -- if you're going to take the hit of
> >> pushing this into the caller, it's better to list them as explicit
> >> clobbers and let the compiler figure out how to do it.  The point of
> >> doing an explicit push/pop is that it can be pushed into the out-of-line
> >> subroutine.
> > 
> > Doh! Of course, this was wrong. See below.
> > 
> > diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> > new file mode 100644
> > index 0000000..9b4bfa5
> > --- /dev/null
> > +++ b/arch/x86/lib/hweight.c
> > @@ -0,0 +1,72 @@
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/bitops.h>
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> > +/* popcnt %eax, %eax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> > +#define REG_IN "a"
> > +#define REG_OUT "a"
> > +#endif
> > +
> > +/*
> > + * Those are called out-of-line in the alternative below and are added here only
> > + * so that gcc is able to figure out which registers have been clobbered by
> > + * __sw_hweightXX so that it could restore their values before returning from
> > + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> > + */
> > +unsigned int _sw_hweight32(unsigned int w)
> > +{
> > +	return __sw_hweight32(w);
> > +}
> > +
> > +unsigned long _sw_hweight64(__u64 w)
> > +{
> > +	return __sw_hweight64(w);
> > +}
> > +
> > +unsigned int __arch_hweight32(unsigned int w)
> > +{
> > +	unsigned int res = 0;
> > +
> > +	asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> > +		     : "="REG_OUT (res)
> > +		     : REG_IN (w));
> > +
> 
> Now you have no clobbers and no assembly wrapper.  That really doesn't work.

I see, this means we have to build lib/hweight.c with -fcall-saved*. I
just did a test and it uses %rcx and %rdx temporarily by saving their
values on the stack and restoring them before it returns.

However, this is generic code and for the above to work we have to
enforce x86-specific CFLAGS for it. What is the preferred way to do
that?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 17:47                                                 ` Borislav Petkov
@ 2010-02-12 19:05                                                   ` H. Peter Anvin
  2010-02-17 13:57                                                     ` Michal Marek
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-12 19:05 UTC (permalink / raw)
  To: Borislav Petkov, Michal Marek, linux-kbuild
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> 
> However, this is generic code and for the above to work we have to
> enforce x86-specific CFLAGS for it. What is the preferred way to do
> that?
> 

That's a question for Michal and the kbuild list.  Michal?

	-hpa


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-11 17:24                                         ` Borislav Petkov
  2010-02-11 17:33                                           ` H. Peter Anvin
@ 2010-02-14 10:12                                           ` Peter Zijlstra
  2010-02-14 11:24                                             ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-14 10:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > Let me prep another version when I get back on Wed. (currently
> > travelling) with all the stuff we discussed to see how it would turn.
> 
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

Right, so I don't like how you need to touch !x86 for this, and I think
that is easily avoidable by not making x86 include
asm-generic/bitops/arch_hweight.h.

If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
and simply provide __arch_hweightN() from x86/include/asm/bitops.h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 10:12                                           ` Peter Zijlstra
@ 2010-02-14 11:24                                             ` Borislav Petkov
  2010-02-14 12:23                                               ` Peter Zijlstra
  2010-02-14 18:36                                               ` H. Peter Anvin
  0 siblings, 2 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-14 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, H. Peter Anvin, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Sun, Feb 14, 2010 at 11:12:23AM +0100, Peter Zijlstra wrote:
> On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> > On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > > Let me prep another version when I get back on Wed. (currently
> > > travelling) with all the stuff we discussed to see how it would turn.
> > 
> > Ok, here's another version ontop of PeterZ's patch at
> > http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> > differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> > while on 64-bit I do "popcnt %rdi, %rdi".
> 
> Right, so I don't like how you need to touch !x86 for this, and I think
> that is easily avoidable by not making x86 include
> asm-generic/bitops/arch_hweight.h.
> 
> If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
> arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
> and simply provide __arch_hweightN() from x86/include/asm/bitops.h

Hmm, all these different names start to get a little confusing. Can we first
agree on the naming please, here's my proposal:

__const_hweightN - for at compile time known constants as arguments
__arch_hweightN - arch possibly has an optimized hweight version
__sw_hweightN - fall back when nothing else is there, aka the functions in
lib/hweight.c

Now, in the x86 case, when the compiler can't know that the argument is
a constant, we call the __arch_hweightN versions. The alternative does
call the __sw_hweightN version in case the CPU doesn't support popcnt.
In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
to be able to take care of the regs clobbered ny __sw_hweightN.

So, if I understand you correctly, your suggestion might work, we
simply need to rename the lib/hweight.c versions to __sw_hweightN
and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
__sw_hweightN wrappers in the default case, all arches which have an
optimized version will provide it in their respective bitops header...

Hows that?

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 11:24                                             ` Borislav Petkov
@ 2010-02-14 12:23                                               ` Peter Zijlstra
  2010-02-14 14:19                                                 ` Borislav Petkov
  2010-02-14 18:36                                               ` H. Peter Anvin
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-14 12:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, H. Peter Anvin, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Sun, 2010-02-14 at 12:24 +0100, Borislav Petkov wrote:
> 
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
> 
I'm not quite sure what the last 'it' refers to, does that refer to:
 1) an __arch_hweightN() implementation, or
 2) __arch_hweightN() -> __sw_hweightN() wrappers ?

If you meant 1, then yes.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 12:23                                               ` Peter Zijlstra
@ 2010-02-14 14:19                                                 ` Borislav Petkov
  0 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-14 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, H. Peter Anvin, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Sun, Feb 14, 2010 at 01:23:35PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-02-14 at 12:24 +0100, Borislav Petkov wrote:
> > 
> > So, if I understand you correctly, your suggestion might work, we
> > simply need to rename the lib/hweight.c versions to __sw_hweightN
> > and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> > __sw_hweightN wrappers in the default case, all arches which have an
> > optimized version will provide it in their respective bitops header...
> > 
> I'm not quite sure what the last 'it' refers to, does that refer to:
>  1) an __arch_hweightN() implementation, or
>  2) __arch_hweightN() -> __sw_hweightN() wrappers ?
> 
> If you meant 1, then yes.

Yes, I mean all architectures which have an optimized hweight will use
that optimized version in their __arch_hweightN while as a default
fallback for the remaining architectures we'll have __arch_hweightN() ->
__sw_hweightN() wrappers in <asm-generic/bitops/arch_hweight.h>.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 11:24                                             ` Borislav Petkov
  2010-02-14 12:23                                               ` Peter Zijlstra
@ 2010-02-14 18:36                                               ` H. Peter Anvin
  2010-02-14 20:28                                                 ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-14 18:36 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Borislav Petkov, Andrew Morton,
	Wu Fengguang <fengguang.w

On 02/14/2010 03:24 AM, Borislav Petkov wrote:
> 
> __const_hweightN - for at compile time known constants as arguments
> __arch_hweightN - arch possibly has an optimized hweight version
> __sw_hweightN - fall back when nothing else is there, aka the functions in
> lib/hweight.c
> 
> Now, in the x86 case, when the compiler can't know that the argument is
> a constant, we call the __arch_hweightN versions. The alternative does
> call the __sw_hweightN version in case the CPU doesn't support popcnt.
> In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> to be able to take care of the regs clobbered ny __sw_hweightN.
> 
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
> 

I'm not entirely sure what you're asking; if what you're asking what to
name an x86-specific fallback function, it presumably should be
__arch_sw_hweightN (i.e. __arch prefix with a modifier.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 18:36                                               ` H. Peter Anvin
@ 2010-02-14 20:28                                                 ` Borislav Petkov
  2010-02-14 22:13                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-14 20:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Borislav Petkov, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Sun, Feb 14, 2010 at 10:36:48AM -0800, H. Peter Anvin wrote:
> On 02/14/2010 03:24 AM, Borislav Petkov wrote:
> > 
> > __const_hweightN - for at compile time known constants as arguments
> > __arch_hweightN - arch possibly has an optimized hweight version
> > __sw_hweightN - fall back when nothing else is there, aka the functions in
> > lib/hweight.c
> > 
> > Now, in the x86 case, when the compiler can't know that the argument is
> > a constant, we call the __arch_hweightN versions. The alternative does
> > call the __sw_hweightN version in case the CPU doesn't support popcnt.
> > In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> > to be able to take care of the regs clobbered ny __sw_hweightN.
> > 
> > So, if I understand you correctly, your suggestion might work, we
> > simply need to rename the lib/hweight.c versions to __sw_hweightN
> > and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> > __sw_hweightN wrappers in the default case, all arches which have an
> > optimized version will provide it in their respective bitops header...
> > 
> 
> I'm not entirely sure what you're asking; if what you're asking what to
> name an x86-specific fallback function, it presumably should be
> __arch_sw_hweightN (i.e. __arch prefix with a modifier.)

Hmm, basically, what PeterZ suggested is that I drop one indirection
under __arch_hweightN, which would make x86-specific fallback functions
superfluous.

IOW, what we have so far is:

#define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))

and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
-> __sw_hweightN wrappers per default, where the __sw_hweightN are the
lib/hweight.c generic versions.

On architectures/CPUs which provide popcnt in
hardware, we create __arch_hweightN implementations in
<arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
<asm-generic/bitops/arch_hweight.h> versions by simply not including
that last header.

Is that agreeable?

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 20:28                                                 ` Borislav Petkov
@ 2010-02-14 22:13                                                   ` H. Peter Anvin
  0 siblings, 0 replies; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-14 22:13 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Borislav Petkov, Andrew Morton,
	Wu Fengguang <fengguang.w

On 02/14/2010 12:28 PM, Borislav Petkov wrote:
> 
> Hmm, basically, what PeterZ suggested is that I drop one indirection
> under __arch_hweightN, which would make x86-specific fallback functions
> superfluous.
> 
> IOW, what we have so far is:
> 
> #define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))
> 
> and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
> -> __sw_hweightN wrappers per default, where the __sw_hweightN are the
> lib/hweight.c generic versions.
> 
> On architectures/CPUs which provide popcnt in
> hardware, we create __arch_hweightN implementations in
> <arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
> <asm-generic/bitops/arch_hweight.h> versions by simply not including
> that last header.
> 
> Is that agreeable?
> 

That makes sense... after all, that's a pretty typical use of asm-generic.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 19:05                                                   ` H. Peter Anvin
@ 2010-02-17 13:57                                                     ` Michal Marek
  2010-02-17 17:20                                                       ` Borislav Petkov
  2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
  0 siblings, 2 replies; 81+ messages in thread
From: Michal Marek @ 2010-02-17 13:57 UTC (permalink / raw)
  To: H. Peter Anvin, Borislav Petkov
  Cc: linux-kbuild, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 12.2.2010 20:05, H. Peter Anvin wrote:
> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>
>> However, this is generic code and for the above to work we have to
>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>> that?
>>
> 
> That's a question for Michal and the kbuild list.  Michal?

(I was offline last week).

The _preferred_ way probably is not to do it :), but otherwise you can
set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
in arch/x86/lib/Makefile already.

Michal

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 13:57                                                     ` Michal Marek
@ 2010-02-17 17:20                                                       ` Borislav Petkov
  2010-02-17 17:31                                                         ` Michal Marek
  2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
  1 sibling, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-17 17:20 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> > 
> > That's a question for Michal and the kbuild list.  Michal?
> 
> (I was offline last week).
> 
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
which gets set in the arch Makefile instead?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:20                                                       ` Borislav Petkov
@ 2010-02-17 17:31                                                         ` Michal Marek
  2010-02-17 17:34                                                           ` Borislav Petkov
  2010-02-17 17:39                                                           ` Michal Marek
  0 siblings, 2 replies; 81+ messages in thread
From: Michal Marek @ 2010-02-17 17:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 17.2.2010 18:20, Borislav Petkov wrote:
> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>
>>>> However, this is generic code and for the above to work we have to
>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>> that?
>>>>
>>>
>>> That's a question for Michal and the kbuild list.  Michal?
>>
>> (I was offline last week).
>>
>> The _preferred_ way probably is not to do it :), but otherwise you can
>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>> in arch/x86/lib/Makefile already.
> 
> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> which gets set in the arch Makefile instead?

We could, but is it worth it if there is only one potential user so far?
IMO just put the condition to lib/Makefile now and if there turn out to
be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Michal

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:31                                                         ` Michal Marek
@ 2010-02-17 17:34                                                           ` Borislav Petkov
  2010-02-17 17:39                                                           ` Michal Marek
  1 sibling, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-17 17:34 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 06:31:04PM +0100, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
> > On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> >> On 12.2.2010 20:05, H. Peter Anvin wrote:
> >>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>>>
> >>>> However, this is generic code and for the above to work we have to
> >>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >>>> that?
> >>>>
> >>>
> >>> That's a question for Michal and the kbuild list.  Michal?
> >>
> >> (I was offline last week).
> >>
> >> The _preferred_ way probably is not to do it :), but otherwise you can
> >> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> >> in arch/x86/lib/Makefile already.
> > 
> > Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> > which gets set in the arch Makefile instead?
> 
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Ok, I'm fine with that too, thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:31                                                         ` Michal Marek
  2010-02-17 17:34                                                           ` Borislav Petkov
@ 2010-02-17 17:39                                                           ` Michal Marek
  2010-02-18  6:19                                                             ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: Michal Marek @ 2010-02-17 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 17.2.2010 18:31, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
>> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>>
>>>>> However, this is generic code and for the above to work we have to
>>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>>> that?
>>>>>
>>>>
>>>> That's a question for Michal and the kbuild list.  Michal?
>>>
>>> (I was offline last week).
>>>
>>> The _preferred_ way probably is not to do it :), but otherwise you can
>>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>>> in arch/x86/lib/Makefile already.
>>
>> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
>> which gets set in the arch Makefile instead?
> 
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

It wouldn't work actually, because such variable would then apply to all
hweight.o targets in the tree. But another way would be:

arch/x86/Kconfig
config ARCH_HWEIGHT_CFLAGS
        string
        default "..." if X86_32
        default "..." if X86_64

lib/Makefile
CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Michal

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:39                                                           ` Michal Marek
@ 2010-02-18  6:19                                                             ` Borislav Petkov
  2010-02-19 14:22                                                               ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-18  6:19 UTC (permalink / raw)
  To: Michal Marek
  Cc: Borislav Petkov, H. Peter Anvin, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 06:39:13PM +0100, Michal Marek wrote:
> It wouldn't work actually, because such variable would then apply to all
> hweight.o targets in the tree. But another way would be:
> 
> arch/x86/Kconfig
> config ARCH_HWEIGHT_CFLAGS
>         string
>         default "..." if X86_32
>         default "..." if X86_64
> 
> lib/Makefile
> CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Yep, this works, albeit with a small adjustment since
CONFIG_ARCH_HWEIGHT_CFLAGS is quoted in the Kconfig and the quotes
appear in the $(CC) call like this:

gcc -Wp,-MD,lib/.hweight.o.d  ... "-fcall-saved-ecx..."

which I fixed like this (idea reused from the make manual):

---
 source "init/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="



I'm open for better suggestions though.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 13:57                                                     ` Michal Marek
  2010-02-17 17:20                                                       ` Borislav Petkov
@ 2010-02-18 10:51                                                       ` Peter Zijlstra
  2010-02-18 11:51                                                         ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-02-18 10:51 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, Borislav Petkov, linux-kbuild, Borislav Petkov,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Wed, 2010-02-17 at 14:57 +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> > 
> > That's a question for Michal and the kbuild list.  Michal?
> 
> (I was offline last week).
> 
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

I guess one way to achieve that is to create a arch/x86/lib/hweight.c
that includes lib/hweight.c and give the x86 one special compile flags
and not build the lib on.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
@ 2010-02-18 11:51                                                         ` Borislav Petkov
  0 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-18 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Marek, H. Peter Anvin, linux-kbuild, Borislav Petkov,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Thu, Feb 18, 2010 at 11:51:50AM +0100, Peter Zijlstra wrote:
> I guess one way to achieve that is to create a arch/x86/lib/hweight.c
> that includes lib/hweight.c and give the x86 one special compile flags
> and not build the lib on.

That's what I thought initially too but that won't fly because the
lib/hweight.c helpers have to be inlined into arch/x86/lib/hweight.c so
that gcc can take care of the clobbered registers. Otherwise, it just a
"call __sw_hweightXX" that gets issued into asm.


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH] x86: Add optimized popcnt variants
  2010-02-18  6:19                                                             ` Borislav Petkov
@ 2010-02-19 14:22                                                               ` Borislav Petkov
  2010-02-19 16:06                                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-19 14:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, Borislav Petkov, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Borislav Petkov <borislav.petkov@amd.com>

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/bitops.h             |    7 +++-
 arch/x86/lib/Makefile                     |    2 +-
 arch/x86/lib/hweight.c                    |   62 +++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h |   22 ++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 109 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..5ec3bd8 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,12 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..54d3cb0
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..1c82306 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.5.4

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 14:22                                                               ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
@ 2010-02-19 16:06                                                                 ` H. Peter Anvin
  2010-02-19 16:45                                                                   ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-19 16:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/19/2010 06:22 AM, Borislav Petkov wrote:
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * __sw_hweightXX are called from within the alternatives below
> + * and callee-clobbered registers need to be taken care of. See
> + * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
> + * compiler switches.
> + */
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +
> +	return res;
> +}
> +EXPORT_SYMBOL(__arch_hweight32);
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	return __arch_hweight32(w & 0xffff);
> +}
> +EXPORT_SYMBOL(__arch_hweight16);
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> +	return __arch_hweight32(w & 0xff);
> +}
> +EXPORT_SYMBOL(__arch_hweight8);
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> +	unsigned long res = 0;
> +
> +#ifdef CONFIG_X86_32
> +	return  __arch_hweight32((u32)w) +
> +		__arch_hweight32((u32)(w >> 32));
> +#else
> +	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +#endif /* CONFIG_X86_32 */
> +
> +	return res;
> +}

You're still not inlining these.  They should be: there is absolutely no
reason for code size to not inline them anymore.

> diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
> index 3a7be84..1c82306 100644
> --- a/include/asm-generic/bitops/arch_hweight.h
> +++ b/include/asm-generic/bitops/arch_hweight.h
> @@ -3,9 +3,23 @@
>  
>  #include <asm/types.h>
>  
> -extern unsigned int __arch_hweight32(unsigned int w);
> -extern unsigned int __arch_hweight16(unsigned int w);
> -extern unsigned int __arch_hweight8(unsigned int w);
> -extern unsigned long __arch_hweight64(__u64 w);
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	return __sw_hweight32(w);
> +}
>  
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	return __sw_hweight16(w);
> +}
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> +	return __sw_hweight8(w);
> +}
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> +	return __sw_hweight64(w);
> +}
>  #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */

and these are in a header file and *definitely* should be inlines.

	-hpa
-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:06                                                                 ` H. Peter Anvin
@ 2010-02-19 16:45                                                                   ` Borislav Petkov
  2010-02-19 16:53                                                                     ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-19 16:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 19, 2010 at 08:06:07AM -0800

<snip>

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > +	unsigned long res = 0;
> > +
> > +#ifdef CONFIG_X86_32
> > +	return  __arch_hweight32((u32)w) +
> > +		__arch_hweight32((u32)(w >> 32));
> > +#else
> > +	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> > +		     : "="REG_OUT (res)
> > +		     : REG_IN (w));
> > +#endif /* CONFIG_X86_32 */
> > +
> > +	return res;
> > +}
> 
> You're still not inlining these.  They should be: there is absolutely no
> reason for code size to not inline them anymore.

Isn't better to have only those 4 locations for apply_alternatives to
patch wrt to popcnt instead of sprinkling alternatives sections around
the kernel in every callsite of hweight and its users? Or is the aim to
optimize even that "call __arch_hweightXX" away?

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > +	return __sw_hweight64(w);
> > +}
> >  #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
> 
> and these are in a header file and *definitely* should be inlines.

Yep, done.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:45                                                                   ` Borislav Petkov
@ 2010-02-19 16:53                                                                     ` H. Peter Anvin
  2010-02-22 14:17                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-19 16:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/19/2010 08:45 AM, Borislav Petkov wrote:
>>
>> You're still not inlining these.  They should be: there is absolutely no
>> reason for code size to not inline them anymore.
> 
> Isn't better to have only those 4 locations for apply_alternatives to
> patch wrt to popcnt instead of sprinkling alternatives sections around
> the kernel in every callsite of hweight and its users? Or is the aim to
> optimize even that "call __arch_hweightXX" away?
> 

That's the idea, yes.  We use inline alternatives in quite a few other
places.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:53                                                                     ` H. Peter Anvin
@ 2010-02-22 14:17                                                                       ` Borislav Petkov
  2010-02-22 17:21                                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-22 14:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 19, 2010 at 08:53:32AM -0800

> That's the idea, yes.  We use inline alternatives in quite a few other
> places.

Ok, inlining results in circa 100+ replacements here both on 32- and
64-bit. Here we go:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..f79b733
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 14:17                                                                       ` Borislav Petkov
@ 2010-02-22 17:21                                                                         ` H. Peter Anvin
  2010-02-22 18:49                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-22 17:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/22/2010 06:17 AM, Borislav Petkov wrote:
>  
> +config ARCH_HWEIGHT_CFLAGS
> +	string
> +	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> +

[...]

> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else

Just a note: this still means rdi is clobbered on x86-64, which is
probably fine, but needs to be recorded as such.  Since gcc doesn't
support clobbers for registers used as operands (sigh), you have to
create a dummy output and assign it a "=D" constraint.

I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
reliably.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 17:21                                                                         ` H. Peter Anvin
@ 2010-02-22 18:49                                                                           ` Borislav Petkov
  2010-02-22 19:55                                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-22 18:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 09:21:05AM -0800

> On 02/22/2010 06:17 AM, Borislav Petkov wrote:
> >  
> > +config ARCH_HWEIGHT_CFLAGS
> > +	string
> > +	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> > +
> 
> [...]
> 
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> 
> Just a note: this still means rdi is clobbered on x86-64, which is
> probably fine, but needs to be recorded as such.  Since gcc doesn't
> support clobbers for registers used as operands (sigh), you have to
> create a dummy output and assign it a "=D" constraint.
> 
> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> reliably.

Ok, from looking at kernel/sched.s output it looks like it saves rdi
content over the alternative where needed. I'll do some more testing
just to make sure.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..cc3a188
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0, dummy;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res), "="REG_IN (dummy)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 18:49                                                                           ` Borislav Petkov
@ 2010-02-22 19:55                                                                             ` H. Peter Anvin
  2010-02-23  6:37                                                                               ` Borislav Petkov
  2010-02-23 15:58                                                                               ` Borislav Petkov
  0 siblings, 2 replies; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-22 19:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/22/2010 10:49 AM, Borislav Petkov wrote:
>>
>> Just a note: this still means rdi is clobbered on x86-64, which is
>> probably fine, but needs to be recorded as such.  Since gcc doesn't
>> support clobbers for registers used as operands (sigh), you have to
>> create a dummy output and assign it a "=D" constraint.
>>
>> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
>> reliably.
> 
> Ok, from looking at kernel/sched.s output it looks like it saves rdi
> content over the alternative where needed. I'll do some more testing
> just to make sure.
> 

No, you can't rely on behavioral observation.  A different version of
gcc could behave differently.  We need to make sure we tell gcc what the
requirements actually are, as opposed to thinking we can just fix them.

+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
0xb8\n\t.byte 0xc7"

BTW, this can be written:

#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

	-hpa

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 19:55                                                                             ` H. Peter Anvin
@ 2010-02-23  6:37                                                                               ` Borislav Petkov
  2010-02-23 15:58                                                                               ` Borislav Petkov
  1 sibling, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-02-23  6:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such.  Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.
> > 
> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> > 
> 
> No, you can't rely on behavioral observation.  A different version of
> gcc could behave differently.  We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Ok, I've added the dummy "=D" constraint since it sounded like the more
stable thing to do WRT gcc versions. BTW, I left the machine overnight
and it is still alive cheerfully building randconfigs.

I'll try the -fcall-saved-rdi thing also later today.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
> 
> BTW, this can be written:
> 
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

done, updated version coming up.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 19:55                                                                             ` H. Peter Anvin
  2010-02-23  6:37                                                                               ` Borislav Petkov
@ 2010-02-23 15:58                                                                               ` Borislav Petkov
  2010-02-23 17:34                                                                                 ` H. Peter Anvin
  1 sibling, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-23 15:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such.  Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.

Hmm, we cannot do that with the current design since __arch_hweight64
is being inlined into every callsite and AFAICT we would have to build
every callsite with "-fcall-saved-rdi" which is clearly too much. The
explicit "=D" dummy constraint is straightforward, instead.

> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> > 
> 
> No, you can't rely on behavioral observation.  A different version of
> gcc could behave differently.  We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Just to make clear: those observations were referring to the version
_with_ the dummy "=D" constraint - I was simply verifying the asm
output and wasn't relying on behavioral observation.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
> 
> BTW, this can be written:
> 
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

Done, here's the latest version, it boots fine and testing is ongoing:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   61 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 ++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 110 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..eaefadc
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,61 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+	/* tell gcc that %rdi is clobbered as an input operand */
+	unsigned long dummy;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res), "="REG_IN (dummy)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 15:58                                                                               ` Borislav Petkov
@ 2010-02-23 17:34                                                                                 ` H. Peter Anvin
  2010-02-23 17:54                                                                                   ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-23 17:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/23/2010 07:58 AM, Borislav Petkov wrote:
>
> Hmm, we cannot do that with the current design since __arch_hweight64
> is being inlined into every callsite and AFAICT we would have to build
> every callsite with "-fcall-saved-rdi" which is clearly too much. The
> explicit "=D" dummy constraint is straightforward, instead.
>

Uh... the -fcall-saved-rdi would go with all the other ones.  Assuming 
it can actually work and that gcc doesn't choke on an inbound argument 
being saved.

	-hpa

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 17:34                                                                                 ` H. Peter Anvin
@ 2010-02-23 17:54                                                                                   ` Borislav Petkov
  2010-02-23 18:17                                                                                     ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-23 17:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 09:34:04AM -0800

> On 02/23/2010 07:58 AM, Borislav Petkov wrote:
> >
> >Hmm, we cannot do that with the current design since __arch_hweight64
> >is being inlined into every callsite and AFAICT we would have to build
> >every callsite with "-fcall-saved-rdi" which is clearly too much. The
> >explicit "=D" dummy constraint is straightforward, instead.
> >
> 
> Uh... the -fcall-saved-rdi would go with all the other ones.
> Assuming it can actually work and that gcc doesn't choke on an
> inbound argument being saved.

Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
might choke on it, I could leave the "=D" dummy constraint in?

BTW, the current version screams

/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h: In function ‘__arch_hweight64’:
/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h:47: warning: unused variable ‘dummy’

on x86-32. I'll send a fixed version in a second.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 17:54                                                                                   ` Borislav Petkov
@ 2010-02-23 18:17                                                                                     ` H. Peter Anvin
  2010-02-23 19:06                                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-23 18:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> 
> Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> might choke on it, I could leave the "=D" dummy constraint in?
> 

I can try it with gcc 3.4 here.  -fcall-saved-rdi is cleaner, if it works.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 18:17                                                                                     ` H. Peter Anvin
@ 2010-02-23 19:06                                                                                       ` Borislav Petkov
  2010-02-26  5:27                                                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-23 19:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 10:17:39AM -0800

> On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> > 
> > Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> > 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> > might choke on it, I could leave the "=D" dummy constraint in?
> > 
> 
> I can try it with gcc 3.4 here.  -fcall-saved-rdi is cleaner, if it works.

Ok, here you go.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..4673dc5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 19:06                                                                                       ` Borislav Petkov
@ 2010-02-26  5:27                                                                                         ` H. Peter Anvin
  2010-02-26  7:47                                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-26  5:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/23/2010 11:06 AM, Borislav Petkov wrote:
> 
> Ok, here you go.
> 
> --
> From: Borislav Petkov <borislav.petkov@amd.com>
> Date: Thu, 11 Feb 2010 00:48:31 +0100
> Subject: [PATCH] x86: Add optimized popcnt variants
> 
> Add support for the hardware version of the Hamming weight function,
> popcnt, present in CPUs which advertize it under CPUID, Function
> 0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
> default lib/hweight.c sw versions.
> 
> A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
> a 3x speedup on a F10h machine.
> 
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>

OK, this patch looks pretty good now, but I'm completely lost as to what
the baseline of this patch is supposed to be.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26  5:27                                                                                         ` H. Peter Anvin
@ 2010-02-26  7:47                                                                                           ` Borislav Petkov
  2010-02-26 17:48                                                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-26  7:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Thu, Feb 25, 2010 at 09:27:25PM -0800

> OK, this patch looks pretty good now, but I'm completely lost as to
> what the baseline of this patch is supposed to be.

Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119

But I'm not sure which tree has it...

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26  7:47                                                                                           ` Borislav Petkov
@ 2010-02-26 17:48                                                                                             ` H. Peter Anvin
  2010-02-27  8:28                                                                                               ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-26 17:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst,
	Andrew Morton

On 02/25/2010 11:47 PM, Borislav Petkov wrote:
> From: "H. Peter Anvin" <hpa@zytor.com>
> Date: Thu, Feb 25, 2010 at 09:27:25PM -0800
> 
>> OK, this patch looks pretty good now, but I'm completely lost as to
>> what the baseline of this patch is supposed to be.
> 
> Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119
> 
> But I'm not sure which tree has it...
> 

Looks like -mm, which really means that either Andrew has to take your
patch, too, or we have to wait until that is upstream until we can merge
your patch.

I'm a little nervous about just acking the patch and telling Andrew to
test it, because I don't know what the fallout would look like.  I'm
particularly concerned about gcc version dependencies.

I guess, on the other hand, if it ends up not getting merged until .35
it's not a huge deal either.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26 17:48                                                                                             ` H. Peter Anvin
@ 2010-02-27  8:28                                                                                               ` Borislav Petkov
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-02-27  8:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 26, 2010 at 09:48:46AM -0800

> Looks like -mm, which really means that either Andrew has to take your
> patch, too, or we have to wait until that is upstream until we can merge
> your patch.
> 
> I'm a little nervous about just acking the patch and telling Andrew to
> test it, because I don't know what the fallout would look like.  I'm
> particularly concerned about gcc version dependencies.

I have the same concern. Actually, I'll be much more at ease if it saw a
bit of wider testing without hitting mainline just yet. I'll try to give
it some more testing with the machines and toolchains I can get my hands
on next week.

> I guess, on the other hand, if it ends up not getting merged until .35
> it's not a huge deal either.

Yeah, let's give it another round of testing and queue it for .35 -
AFAIR Ingo runs also a wide testing effort so it spending another cycle
in -tip and being hammered on by us could give us a bit more certainty.

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27  8:28                                                                                               ` Borislav Petkov
@ 2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
                                                                                                                     ` (3 more replies)
  0 siblings, 4 replies; 81+ messages in thread
From: H. Peter Anvin @ 2010-02-27 20:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> 
>> I guess, on the other hand, if it ends up not getting merged until .35
>> it's not a huge deal either.
> 
> Yeah, let's give it another round of testing and queue it for .35 -
> AFAIR Ingo runs also a wide testing effort so it spending another cycle
> in -tip and being hammered on by us could give us a bit more certainty.
> 

Yes, if we can get into -tip then we'll get more test coverage, so I'll
queue it up for .35 as soon as the merge window closes.  Please remind
me if I forget.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
@ 2010-03-09 15:36                                                                                                   ` Borislav Petkov
  2010-03-09 15:50                                                                                                     ` Peter Zijlstra
  2010-03-18 11:17                                                                                                   ` Borislav Petkov
                                                                                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-03-09 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

> On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > Yeah, let's give it another round of testing and queue it for .35 -
> > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > in -tip and being hammered on by us could give us a bit more certainty.
> > 
> 
> Yes, if we can get into -tip then we'll get more test coverage, so I'll
> queue it up for .35 as soon as the merge window closes.  Please remind
> me if I forget.

Hi Peter,

I see that you've added the HWEIGHT-capitalized interfaces for
compile-time constants with fce877e3. Which means, the bits in
<include/asm-generic/bitops/const_hweight.h> from your original patch at
http://lkml.org/lkml/2010/2/4/119 need changing (or have changed already
but I've missed them).

IOW, where can I get the current version of that last patch so that I can
base my __arch_hweightXX stuff ontop of it for testing?

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
@ 2010-03-09 15:50                                                                                                     ` Peter Zijlstra
  2010-03-09 16:23                                                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-03-09 15:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 16:36 +0100, Borislav Petkov wrote:
> > On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > > Yeah, let's give it another round of testing and queue it for .35 -
> > > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > > in -tip and being hammered on by us could give us a bit more certainty.
> > > 
> > 
> > Yes, if we can get into -tip then we'll get more test coverage, so I'll
> > queue it up for .35 as soon as the merge window closes.  Please remind
> > me if I forget.
> 
> Hi Peter,
> 
> I see that you've added the HWEIGHT-capitalized interfaces for
> compile-time constants with fce877e3. Which means, the bits in
> <include/asm-generic/bitops/const_hweight.h> from your original patch at
> http://lkml.org/lkml/2010/2/4/119 need changing (or have changed already
> but I've missed them).
> 
> IOW, where can I get the current version of that last patch so that I can
> base my __arch_hweightXX stuff ontop of it for testing?

Should all be fine as it is, that patch
( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
in, I've just checked and it still applies to tip/master as of this
writing (although it grew a single 2 line offset for 1 hunk).




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 15:50                                                                                                     ` Peter Zijlstra
@ 2010-03-09 16:23                                                                                                       ` Borislav Petkov
  2010-03-09 16:32                                                                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-03-09 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, Mar 09, 2010 at 04:50:40PM +0100

> Should all be fine as it is, that patch
> ( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
> in, I've just checked and it still applies to tip/master as of this
> writing (although it grew a single 2 line offset for 1 hunk).

Well, this way, I'm getting

...
In file included from include/linux/kernel.h:15,
                 from /home/linux-2.6/arch/x86/include/asm/percpu.h:45,
                 from /home/linux-2.6/arch/x86/include/asm/current.h:5,
                 from /home/linux-2.6/arch/x86/include/asm/processor.h:15,
                 from /home/linux-2.6/arch/x86/include/asm/atomic.h:6,
                 from include/linux/crypto.h:20,
                 from arch/x86/kernel/asm-offsets_64.c:8,
                 from arch/x86/kernel/asm-offsets.c:4:
include/linux/bitops.h:52:1: warning: "HWEIGHT8" redefined
...

due to the fact that we have multiple definitions of HWEIGHT*:

The one batch is in <include/linux/bitops.h> introduced by fce877e3.

The other is in <include/asm-generic/bitops/const_hweight.h> which
is pulled in into <include/linux/bitops.h> through "#include
<asm/bitops.h>", which, in turn, <includes asm/arch_hweight.h> and
<include/asm-generic/bitops/const_hweight.h>.

The obvious resolution is to remove the HWEIGHT* batch from
<include/asm-generic/bitops/const_hweight.h> since they're functionally
identical with the ones in <include/linux/bitops.h>, no?

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 16:23                                                                                                       ` Borislav Petkov
@ 2010-03-09 16:32                                                                                                         ` Peter Zijlstra
  2010-03-09 17:32                                                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2010-03-09 16:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 17:23 +0100, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, Mar 09, 2010 at 04:50:40PM +0100
> 
> > Should all be fine as it is, that patch
> > ( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
> > in, I've just checked and it still applies to tip/master as of this
> > writing (although it grew a single 2 line offset for 1 hunk).
> 
> Well, this way, I'm getting
> 
> ...
> In file included from include/linux/kernel.h:15,
>                  from /home/linux-2.6/arch/x86/include/asm/percpu.h:45,
>                  from /home/linux-2.6/arch/x86/include/asm/current.h:5,
>                  from /home/linux-2.6/arch/x86/include/asm/processor.h:15,
>                  from /home/linux-2.6/arch/x86/include/asm/atomic.h:6,
>                  from include/linux/crypto.h:20,
>                  from arch/x86/kernel/asm-offsets_64.c:8,
>                  from arch/x86/kernel/asm-offsets.c:4:
> include/linux/bitops.h:52:1: warning: "HWEIGHT8" redefined
> ...
> 
> due to the fact that we have multiple definitions of HWEIGHT*:
> 
> The one batch is in <include/linux/bitops.h> introduced by fce877e3.
> 
> The other is in <include/asm-generic/bitops/const_hweight.h> which
> is pulled in into <include/linux/bitops.h> through "#include
> <asm/bitops.h>", which, in turn, <includes asm/arch_hweight.h> and
> <include/asm-generic/bitops/const_hweight.h>.
> 
> The obvious resolution is to remove the HWEIGHT* batch from
> <include/asm-generic/bitops/const_hweight.h> since they're functionally
> identical with the ones in <include/linux/bitops.h>, no?

I thought the patch did that, see this hunk (straight from
http://lkml.org/lkml/2010/2/4/119 ):


---
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -45,31 +45,6 @@ static inline unsigned long hweight_long
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 16:32                                                                                                         ` Peter Zijlstra
@ 2010-03-09 17:32                                                                                                           ` Borislav Petkov
  2010-03-09 17:37                                                                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2010-03-09 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, Mar 09, 2010 at 05:32:49PM +0100

> I thought the patch did that, see this hunk (straight from
> http://lkml.org/lkml/2010/2/4/119 ):

Bollocks, I seem to have lost that hunk while applying the patch by
foot, sorry for the noise.

By the way, I can't seem to find your patch in Andrew's tree, is it
still going through his tree?

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 17:32                                                                                                           ` Borislav Petkov
@ 2010-03-09 17:37                                                                                                             ` Peter Zijlstra
  0 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2010-03-09 17:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 18:32 +0100, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, Mar 09, 2010 at 05:32:49PM +0100
> 
> > I thought the patch did that, see this hunk (straight from
> > http://lkml.org/lkml/2010/2/4/119 ):
> 
> Bollocks, I seem to have lost that hunk while applying the patch by
> foot, sorry for the noise.
> 
> By the way, I can't seem to find your patch in Andrew's tree, is it
> still going through his tree?

I hope so, Andrew, need me to resend or do you still have a copy?




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
@ 2010-03-18 11:17                                                                                                   ` Borislav Petkov
  2010-03-18 11:19                                                                                                   ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
  2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
  3 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Sat, Feb 27, 2010 at 12:00:26PM -0800

> On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > 
> >> I guess, on the other hand, if it ends up not getting merged until .35
> >> it's not a huge deal either.
> > 
> > Yeah, let's give it another round of testing and queue it for .35 -
> > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > in -tip and being hammered on by us could give us a bit more certainty.
> > 
> 
> Yes, if we can get into -tip then we'll get more test coverage, so I'll
> queue it up for .35 as soon as the merge window closes.  Please remind
> me if I forget.

Ok, I've been pretty busy lately and this got pushed back on the todo
list. I finally got around to do some build-testing with a bunch of
compilers and -fcall-saved* seem to get accepted. I haven't stared at
their asm output though, yet:


command:
make CC=gcc-<version> HOSTCC=gcc-<version> -j4

compile stats (64bit only):

not ok:
- gcc-3.3 (GCC) 3.3.5 (Debian 1:3.3.5-13):	OOM KILLER goes off, gcc-3.3 leak maybe

ok:
- gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-13sarge1)
- gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-27)
- gcc-4.3 (Debian 4.3.4-6) 4.3.4
- gcc (Debian 4.4.2-6) 4.4.2
- gcc (Debian 4.4.3-3) 4.4.3

- gcc-3.4.6 (GCC) 3.4.6 (Gentoo 3.4.6-r2 p1.6, ssp-3.4.6-1.0, pie-8.7.10)
- gcc-4.1.2 (GCC) 4.1.2 (Gentoo 4.1.2 p1.3)

I'm attaching the versions of the patches I'm using. The first one by
PeterZ touches a bunch of arches and Andrew hasn't picked it up yet so
the question of getting the second (popcnt) patch to see wider testing
in some tree is still unresolved. Suggestions, ideas?

Thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
  2010-03-18 11:17                                                                                                   ` Borislav Petkov
@ 2010-03-18 11:19                                                                                                   ` Borislav Petkov
  2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
  3 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon, 1 Feb 2010 15:03:07 +0100
Subject: [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation

Rename the extisting runtime hweight() implementations to
__arch_hweight(), rename the compile-time versions to __const_hweight()
and then have hweight() pick between them.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1265028224.24455.154.camel@laptop>
---
 arch/alpha/include/asm/bitops.h            |   18 ++++++-----
 arch/ia64/include/asm/bitops.h             |   11 ++++---
 arch/sparc/include/asm/bitops_64.h         |   11 ++++---
 include/asm-generic/bitops/arch_hweight.h  |   11 +++++++
 include/asm-generic/bitops/const_hweight.h |   42 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h       |    8 +----
 include/linux/bitops.h                     |   25 ----------------
 lib/hweight.c                              |   19 ++++++------
 8 files changed, 87 insertions(+), 58 deletions(-)
 create mode 100644 include/asm-generic/bitops/arch_hweight.h
 create mode 100644 include/asm-generic/bitops/const_hweight.h

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 15f3ae2..296da1d 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -405,29 +405,31 @@ static inline int fls(int x)
 
 #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
 /* Whee.  EV67 can calculate it directly.  */
-static inline unsigned long hweight64(unsigned long w)
+static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_weight32(unsigned int w)
 {
-	return hweight64(w);
+	return __arch_hweight64(w);
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
-	return hweight64(w & 0xffff);
+	return __arch_hweight64(w & 0xffff);
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
-	return hweight64(w & 0xff);
+	return __arch_hweight64(w & 0xff);
 }
 #else
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 #endif
 
+#include <asm-generic/bitops/const_hweight.h>
+
 #endif /* __KERNEL__ */
 
 #include <asm-generic/bitops/find.h>
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 6ebc229..9da3df6 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -437,17 +437,18 @@ __fls (unsigned long x)
  * hweightN: returns the hamming weight (i.e. the number
  * of bits set) of a N-bit word
  */
-static __inline__ unsigned long
-hweight64 (unsigned long x)
+static __inline__ unsigned long __arch_hweight64(unsigned long x)
 {
 	unsigned long result;
 	result = ia64_popcnt(x);
 	return result;
 }
 
-#define hweight32(x)	(unsigned int) hweight64((x) & 0xfffffffful)
-#define hweight16(x)	(unsigned int) hweight64((x) & 0xfffful)
-#define hweight8(x)	(unsigned int) hweight64((x) & 0xfful)
+#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
+#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
+#define __arch_hweight8(x)  ((unsigned int) __arch_hweight64((x) & 0xfful))
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index e72ac9c..766121a 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -44,7 +44,7 @@ extern void change_bit(unsigned long nr, volatile unsigned long *addr);
 
 #ifdef ULTRA_HAS_POPULATION_COUNT
 
-static inline unsigned int hweight64(unsigned long w)
+static inline unsigned int __arch_hweight64(unsigned long w)
 {
 	unsigned int res;
 
@@ -52,7 +52,7 @@ static inline unsigned int hweight64(unsigned long w)
 	return res;
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
@@ -60,7 +60,7 @@ static inline unsigned int hweight32(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res;
 
@@ -68,7 +68,7 @@ static inline unsigned int hweight16(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res;
 
@@ -78,9 +78,10 @@ static inline unsigned int hweight8(unsigned int w)
 
 #else
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 
 #endif
+#include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
new file mode 100644
index 0000000..3a7be84
--- /dev/null
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+
+#include <asm/types.h>
+
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
new file mode 100644
index 0000000..fa2a50b
--- /dev/null
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+
+/*
+ * Compile time versions of __arch_hweightN()
+ */
+#define __const_hweight8(w)		\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
+#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
+#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
+
+/*
+ * Generic interface.
+ */
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+
+/*
+ * Interface for known constant arguments
+ */
+#define HWEIGHT8(w)  (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight8(w))
+#define HWEIGHT16(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight16(w))
+#define HWEIGHT32(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight32(w))
+#define HWEIGHT64(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight64(w))
+
+/*
+ * Type invariant interface to the compile time constant hweight functions.
+ */
+#define HWEIGHT(w)   HWEIGHT64((u64)w)
+
+#endif /* _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..a94d651 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -1,11 +1,7 @@
 #ifndef _ASM_GENERIC_BITOPS_HWEIGHT_H_
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
-#include <asm/types.h>
-
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+#include <asm-generic/bitops/arch_hweight.h>
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index b793898..c55d5bc 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -47,31 +47,6 @@ static inline unsigned long hweight_long(unsigned long w)
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate
diff --git a/lib/hweight.c b/lib/hweight.c
index 63ee4eb..a6927e7 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int __arch_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,29 +24,30 @@ unsigned int hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(__arch_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(__arch_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(__arch_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long __arch_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
+	return __arch_hweight32((unsigned int)(w >> 32)) +
+	       __arch_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -63,4 +64,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(__arch_hweight64);
-- 
1.7.0.2

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/2] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
                                                                                                                     ` (2 preceding siblings ...)
  2010-03-18 11:19                                                                                                   ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
@ 2010-03-18 11:20                                                                                                   ` Borislav Petkov
  3 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Brian Gerst

From: Borislav Petkov <borislav.petkov@amd.com>
Date: Fri, 5 Mar 2010 17:34:46 +0100
Subject: [PATCH 2/2] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0eacb1f..89d8c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index b09ec55..67dae51 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -96,6 +93,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 2e152ae..abe63a8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_BTREE) += btree.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
diff --git a/lib/hweight.c b/lib/hweight.c
index a6927e7..3c79d50 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,30 +24,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -64,4 +64,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.7.0.2

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2010-03-18 11:20 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
2010-02-01 20:44   ` Andrew Morton
2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
2010-02-01 20:48   ` Andrew Morton
2010-02-03 13:39     ` Wu Fengguang
2010-02-03 15:08       ` Andrew Morton
2010-02-03 15:15         ` Peter Zijlstra
2010-02-03 15:42           ` Andrew Morton
2010-02-03 15:47             ` Peter Zijlstra
2010-02-03 17:11               ` H. Peter Anvin
2010-02-03 18:14             ` Borislav Petkov
2010-02-03 18:47               ` Peter Zijlstra
2010-02-03 19:49                 ` H. Peter Anvin
2010-02-04 15:10                   ` Borislav Petkov
2010-02-04 15:13                     ` Peter Zijlstra
2010-02-04 15:54                       ` Borislav Petkov
2010-02-04 16:04                         ` Peter Zijlstra
2010-02-05 12:11                           ` Borislav Petkov
2010-02-05 12:14                             ` Peter Zijlstra
2010-02-05 21:54                             ` H. Peter Anvin
2010-02-06  9:36                               ` Borislav Petkov
2010-02-07  1:55                                 ` H. Peter Anvin
2010-02-08  9:28                                   ` Borislav Petkov
2010-02-08  9:35                                     ` H. Peter Anvin
2010-02-08  9:59                                       ` Borislav Petkov
2010-02-11 17:24                                         ` Borislav Petkov
2010-02-11 17:33                                           ` H. Peter Anvin
2010-02-12 17:06                                             ` Borislav Petkov
2010-02-12 17:28                                               ` H. Peter Anvin
2010-02-12 17:47                                                 ` Borislav Petkov
2010-02-12 19:05                                                   ` H. Peter Anvin
2010-02-17 13:57                                                     ` Michal Marek
2010-02-17 17:20                                                       ` Borislav Petkov
2010-02-17 17:31                                                         ` Michal Marek
2010-02-17 17:34                                                           ` Borislav Petkov
2010-02-17 17:39                                                           ` Michal Marek
2010-02-18  6:19                                                             ` Borislav Petkov
2010-02-19 14:22                                                               ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
2010-02-19 16:06                                                                 ` H. Peter Anvin
2010-02-19 16:45                                                                   ` Borislav Petkov
2010-02-19 16:53                                                                     ` H. Peter Anvin
2010-02-22 14:17                                                                       ` Borislav Petkov
2010-02-22 17:21                                                                         ` H. Peter Anvin
2010-02-22 18:49                                                                           ` Borislav Petkov
2010-02-22 19:55                                                                             ` H. Peter Anvin
2010-02-23  6:37                                                                               ` Borislav Petkov
2010-02-23 15:58                                                                               ` Borislav Petkov
2010-02-23 17:34                                                                                 ` H. Peter Anvin
2010-02-23 17:54                                                                                   ` Borislav Petkov
2010-02-23 18:17                                                                                     ` H. Peter Anvin
2010-02-23 19:06                                                                                       ` Borislav Petkov
2010-02-26  5:27                                                                                         ` H. Peter Anvin
2010-02-26  7:47                                                                                           ` Borislav Petkov
2010-02-26 17:48                                                                                             ` H. Peter Anvin
2010-02-27  8:28                                                                                               ` Borislav Petkov
2010-02-27 20:00                                                                                                 ` H. Peter Anvin
2010-03-09 15:36                                                                                                   ` Borislav Petkov
2010-03-09 15:50                                                                                                     ` Peter Zijlstra
2010-03-09 16:23                                                                                                       ` Borislav Petkov
2010-03-09 16:32                                                                                                         ` Peter Zijlstra
2010-03-09 17:32                                                                                                           ` Borislav Petkov
2010-03-09 17:37                                                                                                             ` Peter Zijlstra
2010-03-18 11:17                                                                                                   ` Borislav Petkov
2010-03-18 11:19                                                                                                   ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
2010-02-18 11:51                                                         ` Borislav Petkov
2010-02-14 10:12                                           ` Peter Zijlstra
2010-02-14 11:24                                             ` Borislav Petkov
2010-02-14 12:23                                               ` Peter Zijlstra
2010-02-14 14:19                                                 ` Borislav Petkov
2010-02-14 18:36                                               ` H. Peter Anvin
2010-02-14 20:28                                                 ` Borislav Petkov
2010-02-14 22:13                                                   ` H. Peter Anvin
2010-02-04 15:16                     ` H. Peter Anvin
2010-02-04 15:39                     ` Brian Gerst
2010-02-03 17:10       ` H. Peter Anvin
2010-01-30  9:45 ` [PATCH 3/5] vfs: O_* bit numbers uniqueness check Wu Fengguang
2010-01-30  9:45 ` [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos Wu Fengguang
2010-01-30  9:45 ` [PATCH 5/5] devmem: dont allow seek to last page Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).