[thisops uV2 00/10] Upgrade of this_cpu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [thisops uV2 00/10] Upgrade of this_cpu_ops V2
@ 2010-11-26 21:09 Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options Christoph Lameter
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

A patchset that adds more this_cpu operations and in particular RMV operations
that can be used in various places to avoid address calculations and
memory accesses.

Also adds this_cpu_cmpxchg_double() which is a locally atomic version of
cmpxchg and uses that to demo how a lockless, preemptless fastpath for
memory allocation could work. Works good enough so that I can write this
email with that fastpath scheme.

V2 has several enhancements and bugfixes that were suggested.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options.
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27 14:42   ` Mathieu Desnoyers
  2010-11-26 21:09 ` [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations Christoph Lameter
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: percpu_fixes --]
[-- Type: text/plain, Size: 4235 bytes --]

The this_cpu_* options can be used to optimize __percpu_counter_add a bit. Avoids
some address arithmetic and saves 12 bytes.

Before:


00000000000001d3 <__percpu_counter_add>:
 1d3:	55                   	push   %rbp
 1d4:	48 89 e5             	mov    %rsp,%rbp
 1d7:	41 55                	push   %r13
 1d9:	41 54                	push   %r12
 1db:	53                   	push   %rbx
 1dc:	48 89 fb             	mov    %rdi,%rbx
 1df:	48 83 ec 08          	sub    $0x8,%rsp
 1e3:	4c 8b 67 30          	mov    0x30(%rdi),%r12
 1e7:	65 4c 03 24 25 00 00 	add    %gs:0x0,%r12
 1ee:	00 00 
 1f0:	4d 63 2c 24          	movslq (%r12),%r13
 1f4:	48 63 c2             	movslq %edx,%rax
 1f7:	49 01 f5             	add    %rsi,%r13
 1fa:	49 39 c5             	cmp    %rax,%r13
 1fd:	7d 0a                	jge    209 <__percpu_counter_add+0x36>
 1ff:	f7 da                	neg    %edx
 201:	48 63 d2             	movslq %edx,%rdx
 204:	49 39 d5             	cmp    %rdx,%r13
 207:	7f 1e                	jg     227 <__percpu_counter_add+0x54>
 209:	48 89 df             	mov    %rbx,%rdi
 20c:	e8 00 00 00 00       	callq  211 <__percpu_counter_add+0x3e>
 211:	4c 01 6b 18          	add    %r13,0x18(%rbx)
 215:	48 89 df             	mov    %rbx,%rdi
 218:	41 c7 04 24 00 00 00 	movl   $0x0,(%r12)
 21f:	00 
 220:	e8 00 00 00 00       	callq  225 <__percpu_counter_add+0x52>
 225:	eb 04                	jmp    22b <__percpu_counter_add+0x58>
 227:	45 89 2c 24          	mov    %r13d,(%r12)
 22b:	5b                   	pop    %rbx
 22c:	5b                   	pop    %rbx
 22d:	41 5c                	pop    %r12
 22f:	41 5d                	pop    %r13
 231:	c9                   	leaveq 
 232:	c3                   	retq   


After:

00000000000001d3 <__percpu_counter_add>:
 1d3:	55                   	push   %rbp
 1d4:	48 63 ca             	movslq %edx,%rcx
 1d7:	48 89 e5             	mov    %rsp,%rbp
 1da:	41 54                	push   %r12
 1dc:	53                   	push   %rbx
 1dd:	48 89 fb             	mov    %rdi,%rbx
 1e0:	48 8b 47 30          	mov    0x30(%rdi),%rax
 1e4:	65 44 8b 20          	mov    %gs:(%rax),%r12d
 1e8:	4d 63 e4             	movslq %r12d,%r12
 1eb:	49 01 f4             	add    %rsi,%r12
 1ee:	49 39 cc             	cmp    %rcx,%r12
 1f1:	7d 0a                	jge    1fd <__percpu_counter_add+0x2a>
 1f3:	f7 da                	neg    %edx
 1f5:	48 63 d2             	movslq %edx,%rdx
 1f8:	49 39 d4             	cmp    %rdx,%r12
 1fb:	7f 21                	jg     21e <__percpu_counter_add+0x4b>
 1fd:	48 89 df             	mov    %rbx,%rdi
 200:	e8 00 00 00 00       	callq  205 <__percpu_counter_add+0x32>
 205:	4c 01 63 18          	add    %r12,0x18(%rbx)
 209:	48 8b 43 30          	mov    0x30(%rbx),%rax
 20d:	48 89 df             	mov    %rbx,%rdi
 210:	65 c7 00 00 00 00 00 	movl   $0x0,%gs:(%rax)
 217:	e8 00 00 00 00       	callq  21c <__percpu_counter_add+0x49>
 21c:	eb 04                	jmp    222 <__percpu_counter_add+0x4f>
 21e:	65 44 89 20          	mov    %r12d,%gs:(%rax)
 222:	5b                   	pop    %rbx
 223:	41 5c                	pop    %r12
 225:	c9                   	leaveq 
 226:	c3                   	retq   

Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>

---
 lib/percpu_counter.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2010-11-03 12:25:37.000000000 -0500
+++ linux-2.6/lib/percpu_counter.c	2010-11-03 12:27:27.000000000 -0500
@@ -72,18 +72,16 @@ EXPORT_SYMBOL(percpu_counter_set);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
-	s32 *pcount;
 
 	preempt_disable();
-	pcount = this_cpu_ptr(fbc->counters);
-	count = *pcount + amount;
+	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
-		*pcount = 0;
+		__this_cpu_write(*fbc->counters, 0);
 		spin_unlock(&fbc->lock);
 	} else {
-		*pcount = count;
+		__this_cpu_write(*fbc->counters, count);
 	}
 	preempt_enable();
 }


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27  8:00   ` Pekka Enberg
  2010-11-27 14:49   ` Mathieu Desnoyers
  2010-11-26 21:09 ` [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return Christoph Lameter
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: vmstat_this_cpu --]
[-- Type: text/plain, Size: 3080 bytes --]

this cpu operations can be used to slightly optimize the function. The
changes will avoid some address calculations and replace them with the
use of the percpu segment register.

If one would have this_cpu_inc_return and this_cpu_dec_return then it
would be possible to optimize inc_zone_page_state and dec_zone_page_state even
more.

V1->V2:
	- Fix __dec_zone_state overflow handling
	- Use s8 variables for temporary storage.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/vmstat.c |   56 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-24 13:38:34.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-24 15:03:08.000000000 -0600
@@ -167,18 +167,20 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-
-	s8 *p = pcp->vm_stat_diff + item;
+	struct per_cpu_pageset * __percpu pcp = zone->pageset;
+	s8 * __percpu p = pcp->vm_stat_diff + item;
 	long x;
+	long t;
+
+	x = delta + __this_cpu_read(*p);
 
-	x = delta + *p;
+	t = __this_cpu_read(pcp->stat_threshold);
 
-	if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
+	if (unlikely(x > t || x < -t)) {
 		zone_page_state_add(x, zone, item);
 		x = 0;
 	}
-	*p = x;
+	__this_cpu_write(*p, x);
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -221,16 +223,19 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-	s8 *p = pcp->vm_stat_diff + item;
-
-	(*p)++;
+	struct per_cpu_pageset * __percpu pcp = zone->pageset;
+	s8 * __percpu p = pcp->vm_stat_diff + item;
+	s8 v, t;
+
+	__this_cpu_inc(*p);
+
+	v = __this_cpu_read(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v > t)) {
+		s8 overstep = t >> 1;
 
-	if (unlikely(*p > pcp->stat_threshold)) {
-		int overstep = pcp->stat_threshold / 2;
-
-		zone_page_state_add(*p + overstep, zone, item);
-		*p = -overstep;
+		zone_page_state_add(v + overstep, zone, item);
+		__this_cpu_write(*p, - overstep);
 	}
 }
 
@@ -242,16 +247,19 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-	s8 *p = pcp->vm_stat_diff + item;
-
-	(*p)--;
-
-	if (unlikely(*p < - pcp->stat_threshold)) {
-		int overstep = pcp->stat_threshold / 2;
+	struct per_cpu_pageset * __percpu pcp = zone->pageset;
+	s8 * __percpu p = pcp->vm_stat_diff + item;
+	s8 v, t;
+
+	__this_cpu_dec(*p);
+
+	v = __this_cpu_read(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v < - t)) {
+		s8 overstep = t >> 1;
 
-		zone_page_state_add(*p - overstep, zone, item);
-		*p = overstep;
+		zone_page_state_add(v - overstep, zone, item);
+		__this_cpu_write(*p, overstep);
 	}
 }
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27 14:58   ` Mathieu Desnoyers
  2010-11-26 21:09 ` [thisops uV2 04/10] x86: Support " Christoph Lameter
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: this_cpu_add_dec_return --]
[-- Type: text/plain, Size: 3712 bytes --]

Introduce generic support for this_cpu_add_return etc.

The fallback is to realize these operations with __this_cpu_ops.

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/percpu.h |   70 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2010-11-23 17:29:46.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2010-11-23 17:31:14.000000000 -0600
@@ -240,6 +240,20 @@ extern void __bad_size_call_parameter(vo
 	pscr_ret__;							\
 })
 
+#define __pcpu_size_call_return2(stem, variable, ...)			\
+({	typeof(variable) pscr_ret__;					\
+	__verify_pcpu_ptr(&(variable));					\
+	switch(sizeof(variable)) {					\
+	case 1: pscr_ret__ = stem##1(variable, __VA_ARGS__);break;	\
+	case 2: pscr_ret__ = stem##2(variable, __VA_ARGS__);break;	\
+	case 4: pscr_ret__ = stem##4(variable, __VA_ARGS__);break;	\
+	case 8: pscr_ret__ = stem##8(variable, __VA_ARGS__);break;	\
+	default:							\
+		__bad_size_call_parameter();break;			\
+	}								\
+	pscr_ret__;							\
+})
+
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -529,6 +543,62 @@ do {									\
 # define __this_cpu_xor(pcp, val)	__pcpu_size_call(__this_cpu_xor_, (pcp), (val))
 #endif
 
+#define _this_cpu_generic_add_return(pcp, val)				\
+({	typeof(pcp) ret__;						\
+	preempt_disable();						\
+	__this_cpu_add((pcp), val);					\
+	ret__ = __this_cpu_read((pcp));					\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_add_return
+# ifndef this_cpu_add_return_1
+#  define this_cpu_add_return_1(pcp, val)	_this_cpu_generic_add_return(pcp, val)
+# endif
+# ifndef this_cpu_add_return_2
+#  define this_cpu_add_return_2(pcp, val)	_this_cpu_generic_add_return(pcp, val)
+# endif
+# ifndef this_cpu_add_return_4
+#  define this_cpu_add_return_4(pcp, val)	_this_cpu_generic_add_return(pcp, val)
+# endif
+# ifndef this_cpu_add_return_8
+#  define this_cpu_add_return_8(pcp, val)	_this_cpu_generic_add_return(pcp, val)
+# endif
+# define this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, (pcp), val)
+#endif
+
+#define this_cpu_sub_return(pcp, val)	this_cpu_add_return(pcp, -(val))
+#define this_cpu_inc_return(pcp)	this_cpu_add_return(pcp, 1)
+#define this_cpu_dec_return(pcp)	this_cpu_add_return(pcp, -1)
+
+#define __this_cpu_generic_add_return(pcp, val)				\
+({	typeof(pcp) ret__;						\
+	__this_cpu_add((pcp), val);					\
+	ret__ = __this_cpu_read((pcp));					\
+	ret__;								\
+})
+
+#ifndef __this_cpu_add_return
+# ifndef __this_cpu_add_return_1
+#  define __this_cpu_add_return_1(pcp, val)	__this_cpu_generic_add_return(pcp, val)
+# endif
+# ifndef __this_cpu_add_return_2
+#  define __this_cpu_add_return_2(pcp, val)	__this_cpu_generic_add_return(pcp, val)
+# endif
+# ifndef __this_cpu_add_return_4
+#  define __this_cpu_add_return_4(pcp, val)	__this_cpu_generic_add_return(pcp, val)
+# endif
+# ifndef __this_cpu_add_return_8
+#  define __this_cpu_add_return_8(pcp, val)	__this_cpu_generic_add_return(pcp, val)
+# endif
+# define __this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, (pcp), val)
+#endif
+
+#define __this_cpu_sub_return(pcp, val)	this_cpu_add_return(pcp, -(val))
+#define __this_cpu_inc_return(pcp)	this_cpu_add_return(pcp, 1)
+#define __this_cpu_dec_return(pcp)	this_cpu_add_return(pcp, -1)
+
 /*
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27  8:06   ` Pekka Enberg
  2010-11-27 15:00   ` Mathieu Desnoyers
  2010-11-26 21:09 ` [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter Christoph Lameter
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: this_cpu_add_x86 --]
[-- Type: text/plain, Size: 2865 bytes --]

Supply an implementation for x86 in order to generate more efficient code.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/include/asm/percpu.h |   50 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-11-23 16:35:19.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2010-11-23 16:36:05.000000000 -0600
@@ -177,6 +177,45 @@ do {									\
 	}								\
 } while (0)
 
+
+/*
+ * Add return operation
+ */
+#define percpu_add_return_op(var, val)					\
+({									\
+	typedef typeof(var) pao_T__;					\
+	typeof(var) pfo_ret__ = val;					\
+	if (0) {							\
+		pao_T__ pao_tmp__;					\
+		pao_tmp__ = (val);					\
+		(void)pao_tmp__;					\
+	}								\
+	switch (sizeof(var)) {						\
+	case 1:								\
+		asm("xaddb %0, "__percpu_arg(1)				\
+			    : "+q" (pfo_ret__), "+m" (var)		\
+			    : : "memory");				\
+		break;							\
+	case 2:								\
+		asm("xaddw %0, "__percpu_arg(1)				\
+			    : "+r" (pfo_ret__), "+m" (var)		\
+			    : : "memory");				\
+		break;							\
+	case 4:								\
+		asm("xaddl %0, "__percpu_arg(1)				\
+			    : "+r"(pfo_ret__), "+m" (var)		\
+			    : : "memory");				\
+		break;							\
+	case 8:								\
+		asm("xaddq %0, "__percpu_arg(1)				\
+			    : "+re" (pfo_ret__),  "+m" (var)		\
+			    : : "memory");				\
+		break;							\
+	default: __bad_percpu_size();					\
+	}								\
+	pfo_ret__ + (val);						\
+})
+
 #define percpu_from_op(op, var, constraint)		\
 ({							\
 	typeof(var) pfo_ret__;				\
@@ -300,6 +339,14 @@ do {									\
 #define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
 
+#ifndef CONFIG_M386
+#define __this_cpu_add_return_1(pcp, val)	percpu_add_return_op((pcp), val)
+#define __this_cpu_add_return_2(pcp, val)	percpu_add_return_op((pcp), val)
+#define __this_cpu_add_return_4(pcp, val)	percpu_add_return_op((pcp), val)
+#define this_cpu_add_return_1(pcp, val)		percpu_add_return_op((pcp), val)
+#define this_cpu_add_return_2(pcp, val)		percpu_add_return_op((pcp), val)
+#define this_cpu_add_return_4(pcp, val)		percpu_add_return_op((pcp), val)
+#endif
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -324,6 +371,9 @@ do {									\
 #define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 
+#define __this_cpu_add_return_8(pcp, val)	percpu_add_return_op((pcp), val)
+#define this_cpu_add_return_8(pcp, val)	percpu_add_return_op((pcp), val)
+
 #endif
 
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 04/10] x86: Support " Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27  8:07   ` Pekka Enberg
  2010-11-27 15:00   ` Mathieu Desnoyers
  2010-11-26 21:09 ` [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics Christoph Lameter
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: this_cpu_add_nmi --]
[-- Type: text/plain, Size: 898 bytes --]

this_cpu_inc_return() saves us a memory access there.

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/kernel/apic/nmi.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/nmi.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/nmi.c	2010-11-23 16:35:19.000000000 -0600
+++ linux-2.6/arch/x86/kernel/apic/nmi.c	2010-11-23 16:38:29.000000000 -0600
@@ -432,8 +432,7 @@ nmi_watchdog_tick(struct pt_regs *regs,
 		 * Ayiee, looks like this CPU is stuck ...
 		 * wait a few IRQs (5 seconds) before doing the oops ...
 		 */
-		__this_cpu_inc(alert_counter);
-		if (__this_cpu_read(alert_counter) == 5 * nmi_hz)
+		if (__this_cpu_inc_return(alert_counter) == 5 * nmi_hz)
 			/*
 			 * die_nmi will return ONLY if NOTIFY_STOP happens..
 			 */


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (4 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27  8:09   ` Pekka Enberg
  2010-11-26 21:09 ` [thisops uV2 07/10] highmem: Use this_cpu_xx_return() operations Christoph Lameter
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: this_cpu_add_vmstat --]
[-- Type: text/plain, Size: 1977 bytes --]

this_cpu_inc_return() saves us a memory access there. Code
size does not change.

V1->V2:
	- Fixed the location of the __per_cpu pointer attributes
	- Sparse checked

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/vmstat.c |   20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-26 10:35:45.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-26 10:45:49.000000000 -0600
@@ -167,8 +167,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset * __percpu pcp = zone->pageset;
-	s8 * __percpu p = pcp->vm_stat_diff + item;
+	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
 	long x;
 	long t;
 
@@ -223,13 +223,11 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset * __percpu pcp = zone->pageset;
-	s8 * __percpu p = pcp->vm_stat_diff + item;
+	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
-	__this_cpu_inc(*p);
-
-	v = __this_cpu_read(*p);
+	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
 		s8 overstep = t >> 1;
@@ -247,13 +245,11 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset * __percpu pcp = zone->pageset;
-	s8 * __percpu p = pcp->vm_stat_diff + item;
+	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
-	__this_cpu_dec(*p);
-
-	v = __this_cpu_read(*p);
+	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
 		s8 overstep = t >> 1;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 07/10] highmem: Use this_cpu_xx_return() operations
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (5 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 08/10] percpu: generic this_cpu_cmpxchg() and this_cpu_cmpxchg_double support Christoph Lameter
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: highmem --]
[-- Type: text/plain, Size: 1304 bytes --]

Use this_cpu operations to optimize access primitives for highmem.

The main effect is the avoidance of address calculations through the
use of a segment prefix.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/highmem.h |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h	2010-11-22 14:43:40.000000000 -0600
+++ linux-2.6/include/linux/highmem.h	2010-11-22 14:45:02.000000000 -0600
@@ -81,7 +81,8 @@ DECLARE_PER_CPU(int, __kmap_atomic_idx);
 
 static inline int kmap_atomic_idx_push(void)
 {
-	int idx = __get_cpu_var(__kmap_atomic_idx)++;
+	int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
+
 #ifdef CONFIG_DEBUG_HIGHMEM
 	WARN_ON_ONCE(in_irq() && !irqs_disabled());
 	BUG_ON(idx > KM_TYPE_NR);
@@ -91,12 +92,12 @@ static inline int kmap_atomic_idx_push(v
 
 static inline int kmap_atomic_idx(void)
 {
-	return __get_cpu_var(__kmap_atomic_idx) - 1;
+	return __this_cpu_read(__kmap_atomic_idx) - 1;
 }
 
 static inline int kmap_atomic_idx_pop(void)
 {
-	int idx = --__get_cpu_var(__kmap_atomic_idx);
+	int idx = __this_cpu_dec_return(__kmap_atomic_idx);
 #ifdef CONFIG_DEBUG_HIGHMEM
 	BUG_ON(idx < 0);
 #endif


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 08/10] percpu: generic this_cpu_cmpxchg() and this_cpu_cmpxchg_double support
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (6 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 07/10] highmem: Use this_cpu_xx_return() operations Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations Christoph Lameter
  2010-11-26 21:09 ` [thisops uV2 10/10] slub: Lockless fastpaths Christoph Lameter
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: this_cpu_cmpxchg --]
[-- Type: text/plain, Size: 8222 bytes --]

Provide arch code to create the (local atomic) instructions.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/percpu.h |  179 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 178 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2010-11-24 10:37:57.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2010-11-24 10:38:11.000000000 -0600
@@ -254,6 +254,21 @@ extern void __bad_size_call_parameter(vo
 	pscr_ret__;							\
 })
 
+/* Special handling for cmpxchg_double */
+#define __pcpu_size_call_return_int(stem, pcp, ...)			\
+({	int pscr_ret__;							\
+	__verify_pcpu_ptr(pcp);						\
+	switch(sizeof(*pcp)) {						\
+	case 1: pscr_ret__ = stem##1(pcp, __VA_ARGS__);break;		\
+	case 2: pscr_ret__ = stem##2(pcp, __VA_ARGS__);break;		\
+	case 4: pscr_ret__ = stem##4(pcp, __VA_ARGS__);break;		\
+	case 8: pscr_ret__ = stem##8(pcp, __VA_ARGS__);break;		\
+	default:							\
+		__bad_size_call_parameter();break;			\
+	}								\
+	pscr_ret__;							\
+})
+
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -317,6 +332,134 @@ do {									\
 # define this_cpu_read(pcp)	__pcpu_size_call_return(this_cpu_read_, (pcp))
 #endif
 
+#define _this_cpu_generic_cmpxchg(pcp, oval, nval)			\
+({	typeof(pcp) ret__;						\
+	preempt_disable();						\
+	ret__ = __this_cpu_read(pcp);					\
+	if (ret__ == (oval))						\
+		__this_cpu_write(pcp, nval);				\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_cmpxchg
+# ifndef this_cpu_cmpxchg_1
+#  define this_cpu_cmpxchg_1(pcp, oval, nval)	_this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# ifndef this_cpu_cmpxchg_2
+#  define this_cpu_cmpxchg_2(pcp, oval, nval)	_this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# ifndef this_cpu_cmpxchg_4
+#  define this_cpu_cmpxchg_4(pcp, oval, nval)	_this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# ifndef this_cpu_cmpxchg_8
+#  define this_cpu_cmpxchg_8(pcp, oval, nval)	_this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# define this_cpu_cmpxchg(pcp, oval, nval)	__pcpu_size_call_return2(this_cpu_cmpxchg_, (pcp), oval, nval)
+#endif
+
+#define __this_cpu_generic_cmpxchg(pcp, oval, nval)			\
+({	typeof(pcp) ret__;						\
+	ret__ = __this_cpu_read(pcp);					\
+	if (ret__ == (oval))						\
+		__this_cpu_write(pcp, nval);				\
+	ret__;								\
+})
+
+#ifndef __this_cpu_cmpxchg
+# ifndef __this_cpu_cmpxchg_1
+#  define __this_cpu_cmpxchg_1(pcp, oval, nval)	__this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# ifndef __this_cpu_cmpxchg_2
+#  define __this_cpu_cmpxchg_2(pcp, oval, nval)	__this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# ifndef __this_cpu_cmpxchg_4
+#  define __this_cpu_cmpxchg_4(pcp, oval, nval)	__this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# ifndef __this_cpu_cmpxchg_8
+#  define __this_cpu_cmpxchg_8(pcp, oval, nval)	__this_cpu_generic_cmpxchg(pcp, oval, nval)
+# endif
+# define __this_cpu_cmpxchg(pcp, oval, nval)	__pcpu_size_call_return2(\
+				__this_cpu_cmpxchg_, (pcp), oval, nval)
+#endif
+
+/*
+ * cmpxchg_double replaces two adjacent scalars at once. The first parameter
+ * passed is a percpu pointer, not a scalar like the other this_cpu
+ * operations. This is so because the function operates on two scalars
+ * (must be of same size). A truth value is returned to indicate success or
+ * failure (since a double register result is difficult to handle).
+ * There is very limited hardware support for these operations. So only certain
+ * sizes may work.
+ */
+#define __this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)	\
+({	typeof(oval2) * __percpu pcp2 = (typeof(oval2) *)((pcp) + 1);	\
+	int __ret = 0;							\
+	if (__this_cpu_read(*pcp) == (oval1) &&				\
+			 __this_cpu_read(*pcp2)  == (oval2)) {		\
+		__this_cpu_write(*pcp, (nval1));			\
+		__this_cpu_write(*pcp2, (nval2));			\
+		__ret = 1;						\
+	}								\
+	(__ret);							\
+})
+
+#ifndef __this_cpu_cmpxchg_double
+# ifndef __this_cpu_cmpxchg_double_1
+#  define __this_cpu_cmpxchg_double_1(pcp, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_2
+#  define __this_cpu_cmpxchg_double_2(pcp, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_4
+#  define __this_cpu_cmpxchg_double_4(pcp, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_8
+#  define __this_cpu_cmpxchg_double_8(pcp, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# define __this_cpu_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)	\
+	__pcpu_size_call_return_int(__this_cpu_cmpxchg_double_, (pcp),	\
+					 oval1, oval2, nval1, nval2)
+#endif
+
+#define _this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)	\
+({	int ret__;							\
+	preempt_disable();						\
+	ret__ = __this_cpu_generic_cmpxchg_double(pcp,			\
+			oval1, oval2, nval1, nval2);			\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_cmpxchg_double
+# ifndef this_cpu_cmpxchg_double_1
+#  define this_cpu_cmpxchg_double_1(pcp, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_2
+#  define this_cpu_cmpxchg_double_2(pcp, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_4
+#  define this_cpu_cmpxchg_double_4(pcp, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_8
+#  define this_cpu_cmpxchg_double_8(pcp, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# define this_cpu_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)	\
+	__pcpu_size_call_return_int(this_cpu_cmpxchg_double_, (pcp),	\
+		oval1, oval2, nval1, nval2)
+#endif
+
+
+
+
 #define _this_cpu_generic_to_op(pcp, val, op)				\
 do {									\
 	preempt_disable();						\
@@ -603,7 +746,7 @@ do {									\
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another
  * processors (which one gets when using regular atomic operations)
- . They are guaranteed to be atomic vs. local interrupts and
+ * They are guaranteed to be atomic vs. local interrupts and
  * preemption only.
  */
 #define irqsafe_cpu_generic_to_op(pcp, val, op)				\
@@ -690,4 +833,38 @@ do {									\
 # define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val))
 #endif
 
+#define irqsafe_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)	\
+({	int ret__;							\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
+	ret__ = __this_cpu_generic_cmpxchg_double(pcp,			\
+			oval1, oval2, nval1, nval2);			\
+	local_irq_restore(flags);					\
+	ret__;								\
+})
+
+#ifndef irqsafe_cmpxchg_double
+# ifndef irqsafe_cmpxchg_double_1
+#  define irqsafe_cmpxchg_double_1(pcp, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cmpxchg_double_2
+#  define irqsafe_cmpxchg_double_2(pcp, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cmpxchg_double_4
+#  define irqsafe_cmpxchg_double_4(pcp, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cmpxchg_double_8
+#  define irqsafe_cmpxchg_double_8(pcp, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)
+# endif
+# define irqsafe_cmpxchg_double(pcp, oval1, oval2, nval1, nval2)	\
+	__pcpu_size_call_return_int(irqsafe_cmpxchg_double_, (pcp),	\
+		oval1, oval2, nval1, nval2)
+#endif
+
+
+
 #endif /* __LINUX_PERCPU_H */


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (7 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 08/10] percpu: generic this_cpu_cmpxchg() and this_cpu_cmpxchg_double support Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  2010-11-27  6:30   ` Eric Dumazet
  2010-11-27 15:20   ` Mathieu Desnoyers
  2010-11-26 21:09 ` [thisops uV2 10/10] slub: Lockless fastpaths Christoph Lameter
  9 siblings, 2 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: this_cpu_cmpxchg_x86 --]
[-- Type: text/plain, Size: 7898 bytes --]

Provide support as far as the hardware capabilities of the x86 cpus
allow.

V1->V2:
	- Mark %rdx clobbering during cmpxchg16b
	- Provide emulation of cmpxchg16b for early AMD processors

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/include/asm/percpu.h |  122 +++++++++++++++++++++++++++++++++++++++++-
 arch/x86/lib/Makefile         |    1 
 arch/x86/lib/cmpxchg16b_emu.S |   55 ++++++++++++++++++
 3 files changed, 177 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-11-26 12:37:37.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2010-11-26 14:42:49.000000000 -0600
@@ -216,6 +216,41 @@ do {									\
 	pfo_ret__ + (val);						\
 })
 
+#define percpu_cmpxchg_op(var, oval, nval)				\
+({									\
+	typeof(var) __ret;						\
+	typeof(var) __old = (oval);					\
+	typeof(var) __new = (nval);					\
+	switch (sizeof(var)) {						\
+	case 1:								\
+		asm("cmpxchgb %2, "__percpu_arg(1)			\
+			    : "=a" (__ret), "+m" (&var)			\
+			    : "q" (__new), "0" (__old)			\
+			    : "memory");				\
+		break;							\
+	case 2:								\
+		asm("cmpxchgw %2, "__percpu_arg(1)			\
+			    : "=a" (__ret), "+m" (&var)			\
+			    : "r" (__new), "0" (__old)			\
+			    : "memory");				\
+		break;							\
+	case 4:								\
+		asm("cmpxchgl %2, "__percpu_arg(1)			\
+			    : "=a" (__ret), "+m" (&var)			\
+			    : "r" (__new), "0" (__old)			\
+			    : "memory");				\
+		break;							\
+	case 8:								\
+		asm("cmpxchgq %2, "__percpu_arg(1)			\
+			    : "=a" (__ret), "+m" (&var)			\
+			    : "r" (__new), "0" (__old)			\
+			    : "memory");				\
+		break;							\
+	default: __bad_percpu_size();					\
+	}								\
+	__ret;								\
+})
+
 #define percpu_from_op(op, var, constraint)		\
 ({							\
 	typeof(var) pfo_ret__;				\
@@ -346,7 +381,56 @@ do {									\
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op((pcp), val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op((pcp), val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op((pcp), val)
-#endif
+
+#define __this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#define __this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#define __this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#endif /* !CONFIG_M386 */
+
+#ifdef CONFIG_X86_CMPXCHG64
+#define percpu_cmpxchg8b_double(pcp, o1, o2, n1, n2)			\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy = n2;						\
+	asm("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t"		\
+		    : "=a"(__ret), "=m" (*pcp), "=d"(__dummy)		\
+		    :  "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2));	\
+	__ret;								\
+})
+#endif /* CONFIG_X86_CMPXCHG64 */
+
+#define __this_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2) percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2)	percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#define irqsafe_cmpxchg_double_4(pcp, o1, o2, n1, n2)	percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+
+#ifndef CONFIG_X86_64
+#ifdef CONFIG_X86_CMPXCHG64
+/* We can support a 8 byte cmpxchg with a special instruction on 32 bit */
+#define __this_cpu_cmpxchg_8(pcp, oval, nval)				\
+({									\
+	typeof(var) __ret;						\
+	typeof(var) __old = (oval);					\
+	typeof(var) __new = (nval);					\
+	asm("cmpxchg8b %2, "__percpu_arg(1)				\
+	    : "=A" (__ret), "+m" (&pcp)					\
+	    : "b" (((u32)new), "c" ((u32)(new >> 32)),  "0" (__old)	\
+	    : "memory");						\
+	__ret;								\
+})
+
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	__this_cpu_cmpxchg_8(pcp, oval, nval)
+#define irqsafe_cmpxchg_8(pcp, oval, nval)	__this_cpu_cmpxchg_8(pcp, oval, nval)
+
+#endif /* CONFIG_X86_CMPXCHG64 */
+#endif /* !CONFIG_X86_64 */
+
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -374,6 +458,42 @@ do {									\
 #define __this_cpu_add_return_8(pcp, val)	percpu_add_return_op((pcp), val)
 #define this_cpu_add_return_8(pcp, val)	percpu_add_return_op((pcp), val)
 
+#define __this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op((pcp), oval, nval)
+
+/*
+ * Pretty complex macro to generate cmpxchg16 instruction. The instruction
+ * is not supported on early AMD64 processors so we must be able to emulate
+ * it in software. The address used in the cmpxchg16 instruction must be
+ * aligned to a 16 byte boundary.
+ */
+
+/*
+ * Something is screwed up with alternate instruction creation. This one
+ * fails with a mysterious asm error about a byte val > 255.
+ */
+#define percpu_cmpxchg16b(pcp, o1, o2, n1, n2)			\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy;						\
+	VM_BUG_ON(((unsigned long)pcp) % 16);				\
+	alternative_io("call cmpxchg16b_local\n\t" P6_NOP4,		\
+			"cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t",	\
+			X86_FEATURE_CX16,				\
+		    	ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)),	\
+		        "S" (pcp), "b"(__n1), "c"(__n2),		\
+			 "a"(__o1), "d"(__o2));				\
+	__ret;								\
+})
+
+#define __this_cpu_cmpxchg_double_8(pcp, o1, o2, n1, n2) percpu_cmpxchg16b((pcp), o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_8(pcp, o1, o2, n1, n2)	percpu_cmpxchg16b((pcp), o1, o2, n1, n2)
+#define irqsafe_cmpxchg_double_8(pcp, o1, o2, n1, n2)	percpu_cmpxchg16b((pcp), o1, o2, n1, n2)
+
 #endif
 
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
Index: linux-2.6/arch/x86/lib/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/lib/Makefile	2010-11-26 12:08:33.000000000 -0600
+++ linux-2.6/arch/x86/lib/Makefile	2010-11-26 12:50:50.000000000 -0600
@@ -42,4 +42,5 @@ else
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o rwlock_64.o copy_user_nocache_64.o
 	lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem_64.o
+	lib-y += cmpxchg16b_emu.o
 endif
Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S	2010-11-26 13:14:02.000000000 -0600
@@ -0,0 +1,55 @@
+/*
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; version 2
+ *	of the License.
+ *
+ */
+
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/frame.h>
+#include <asm/dwarf2.h>
+
+
+.text
+
+/*
+ * Inputs:
+ * %rsi : memory location to compare
+ * %rax : low 64 bits of old value
+ * %rdx : high 64 bits of old value
+ * %rbx : low 64 bits of new value
+ * %rcx : high 64 bits of new value
+ * %al  : Operation successful
+ */
+ENTRY(cmpxchg16b_local_emu)
+CFI_STARTPROC
+
+#
+# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in
+# al not via the ZF. Caller will access al to get result.
+#
+cmpxchg16b_local_emu:
+	pushf
+	cli
+
+	cmpq  %gs:(%rsi), %rax
+	jne not_same
+	cmpq %gs:8(%rsi), %rdx
+	jne not_same
+
+	movq %rbx,  %gs:(%esi)
+	movq %rcx, %gs:8(%esi)
+
+	popf
+	mov $1, %al
+	ret
+
+ not_same:
+	popf
+	xor  %al,%al
+	ret
+
+CFI_ENDPROC
+ENDPROC(cmpxchg16b_local_emu)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [thisops uV2 10/10] slub: Lockless fastpaths
  2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
                   ` (8 preceding siblings ...)
  2010-11-26 21:09 ` [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations Christoph Lameter
@ 2010-11-26 21:09 ` Christoph Lameter
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-26 21:09 UTC (permalink / raw)
  To: akpm; +Cc: Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: slub_generation --]
[-- Type: text/plain, Size: 21942 bytes --]

Use the this_cpu_cmpxchg_double functionality to implement a lockless
allocation algorithm.

Each of the per cpu pointers is paired with a transaction id that ensures
that updates of the per cpu information can only occur in sequence on
a certain cpu.

A transaction id is a "long" integer that is comprised of an event number
and the cpu number. The event number is incremented for every change to the
per cpu state. This means that the cmpxchg instruction can verify for an
update that nothing interfered and that we are updating the percpu structure
for the processor where we picked up the information and that we are also
currently on that processor when we update the information.

This results in a significant decrease of the overhead in the fastpaths. It
also makes it easy to adopt the fast path for realtime kernels since this
is lockless and does not require that the use of the current per cpu area
over the critical section. It is only important that the per cpu area is
current at the beginning of the critical section and at that end.

So there is no need even to disable preemption which will make the allocations
scale well in a RT environment.

V1->V2:
	- Round cpu number to the next power of two.
	- Deal with debug hooks that expect irqs to be disabled.
	- Performance test

Test results show that the fastpath cycle count is reduced by up to ~ 40%
(alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
adds a few cycles.

Sadly this does nothing for the slowpath which is where the issues with
performance in slub are but the best case performance rises significantly.

Before:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 102 cycles kfree -> 111 cycles
10000 times kmalloc(16) -> 101 cycles kfree -> 111 cycles
10000 times kmalloc(32) -> 120 cycles kfree -> 114 cycles
10000 times kmalloc(64) -> 161 cycles kfree -> 130 cycles
10000 times kmalloc(128) -> 284 cycles kfree -> 129 cycles
10000 times kmalloc(256) -> 410 cycles kfree -> 134 cycles
10000 times kmalloc(512) -> 312 cycles kfree -> 197 cycles
10000 times kmalloc(1024) -> 377 cycles kfree -> 494 cycles
10000 times kmalloc(2048) -> 571 cycles kfree -> 522 cycles
10000 times kmalloc(4096) -> 674 cycles kfree -> 565 cycles
10000 times kmalloc(8192) -> 836 cycles kfree -> 648 cycles
10000 times kmalloc(16384) -> 1201 cycles kfree -> 775 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 142 cycles
10000 times kmalloc(16)/kfree -> 142 cycles
10000 times kmalloc(32)/kfree -> 142 cycles
10000 times kmalloc(64)/kfree -> 142 cycles
10000 times kmalloc(128)/kfree -> 142 cycles
10000 times kmalloc(256)/kfree -> 140 cycles
10000 times kmalloc(512)/kfree -> 140 cycles
10000 times kmalloc(1024)/kfree -> 144 cycles
10000 times kmalloc(2048)/kfree -> 144 cycles
10000 times kmalloc(4096)/kfree -> 144 cycles
10000 times kmalloc(8192)/kfree -> 144 cycles
10000 times kmalloc(16384)/kfree -> 913 cycles

After:

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 69 cycles kfree -> 115 cycles
10000 times kmalloc(16) -> 73 cycles kfree -> 115 cycles
10000 times kmalloc(32) -> 86 cycles kfree -> 119 cycles
10000 times kmalloc(64) -> 122 cycles kfree -> 125 cycles
10000 times kmalloc(128) -> 247 cycles kfree -> 132 cycles
10000 times kmalloc(256) -> 375 cycles kfree -> 137 cycles
10000 times kmalloc(512) -> 283 cycles kfree -> 183 cycles
10000 times kmalloc(1024) -> 316 cycles kfree -> 504 cycles
10000 times kmalloc(2048) -> 516 cycles kfree -> 531 cycles
10000 times kmalloc(4096) -> 610 cycles kfree -> 570 cycles
10000 times kmalloc(8192) -> 759 cycles kfree -> 651 cycles
10000 times kmalloc(16384) -> 1169 cycles kfree -> 778 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 81 cycles
10000 times kmalloc(16)/kfree -> 81 cycles
10000 times kmalloc(32)/kfree -> 81 cycles
10000 times kmalloc(64)/kfree -> 81 cycles
10000 times kmalloc(128)/kfree -> 81 cycles
10000 times kmalloc(256)/kfree -> 87 cycles
10000 times kmalloc(512)/kfree -> 87 cycles
10000 times kmalloc(1024)/kfree -> 87 cycles
10000 times kmalloc(2048)/kfree -> 84 cycles
10000 times kmalloc(4096)/kfree -> 81 cycles
10000 times kmalloc(8192)/kfree -> 81 cycles
10000 times kmalloc(16384)/kfree -> 927 cycles


Concurrent allocs
=================
Before:

Kmalloc N*alloc N*free(8): 0=118/128 1=125/131 2=118/147 3=123/133 4=115/128 5=122/149 6=118/130 7=120/133 Average=120/135
Kmalloc N*alloc N*free(16): 0=137/147 1=154/141 2=128/145 3=130/143 4=122/143 5=144/142 6=128/141 7=145/142 Average=136/143
Kmalloc N*alloc N*free(32): 0=170/269 1=175/269 2=159/273 3=193/266 4=156/268 5=160/273 6=163/267 7=185/268 Average=170/269
Kmalloc N*alloc N*free(64): 0=284/1090 1=257/1088 2=258/1086 3=259/1086 4=242/1077 5=264/1091 6=279/1086 7=255/1089 Average=262/1087
Kmalloc N*alloc N*free(128): 0=445/2320 1=444/2323 2=437/2318 3=443/2321 4=416/2309 5=435/2323 6=428/2310 7=434/2321 Average=435/2318
Kmalloc N*alloc N*free(256): 0=1138/1171 1=1153/1271 2=1144/1262 3=1155/1281 4=1147/1246 5=1154/1229 6=1147/1252 7=1149/1192 Average=1148/1238
Kmalloc N*alloc N*free(512): 0=1684/1741 1=1700/1781 2=1709/1751 3=1708/1723 4=1696/1650 5=1707/1808 6=1697/1775 7=1706/1814 Average=1701/1755
Kmalloc N*alloc N*free(1024): 0=3116/3334 1=3224/3338 2=3215/3337 3=3159/3335 4=3118/3289 5=3090/3354 6=3119/3272 7=3091/3352 Average=3142/3326
Kmalloc N*alloc N*free(2048): 0=3342/3408 1=3371/3388 2=3357/3436 3=3356/3465 4=3367/3445 5=3372/3460 6=3357/3438 7=3369/3433 Average=3361/3434
Kmalloc N*alloc N*free(4096): 0=3641/8467 1=3659/8496 2=3651/8454 3=3660/8488 4=3659/8446 5=3656/8488 6=3646/8406 7=3664/8478 Average=3654/8465
Kmalloc N*(alloc free)(8): 0=154 1=156 2=153 3=156 4=154 5=156 6=155 7=156 Average=155
Kmalloc N*(alloc free)(16): 0=154 1=156 2=154 3=156 4=154 5=156 6=154 7=156 Average=155
Kmalloc N*(alloc free)(32): 0=154 1=156 2=153 3=156 4=154 5=156 6=154 7=156 Average=155
Kmalloc N*(alloc free)(64): 0=154 1=156 2=153 3=156 4=155 5=157 6=155 7=157 Average=156
Kmalloc N*(alloc free)(128): 0=154 1=156 2=153 3=156 4=154 5=158 6=154 7=156 Average=155
Kmalloc N*(alloc free)(256): 0=156 1=153 2=150 3=157 4=151 5=153 6=155 7=157 Average=154
Kmalloc N*(alloc free)(512): 0=157 1=168 2=154 3=157 4=155 5=157 6=155 7=157 Average=157
Kmalloc N*(alloc free)(1024): 0=157 1=157 2=155 3=159 4=157 5=158 6=157 7=160 Average=158
Kmalloc N*(alloc free)(2048): 0=155 1=157 2=155 3=172 4=156 5=157 6=152 7=157 Average=158
Kmalloc N*(alloc free)(4096): 0=155 1=157 2=155 3=157 4=155 5=157 6=156 7=157 Average=156

After:

Kmalloc N*alloc N*free(8): 0=90/168 1=86/139 2=82/137 3=88/147 4=81/142 5=89/164 6=87/141 7=100/140 Average=88/147
Kmalloc N*alloc N*free(16): 0=93/158 1=116/161 2=94/149 3=103/161 4=87/149 5=113/168 6=90/149 7=119/153 Average=102/156
Kmalloc N*alloc N*free(32): 0=137/191 1=154/192 2=129/187 3=139/188 4=122/186 5=134/195 6=125/186 7=147/185 Average=136/189
Kmalloc N*alloc N*free(64): 0=227/1054 1=269/1062 2=226/1059 3=235/1059 4=198/1063 5=220/1061 6=248/1052 7=211/1062 Average=229/1059
Kmalloc N*alloc N*free(128): 0=406/2217 1=383/2247 2=395/2218 3=389/2240 4=386/2208 5=405/2224 6=381/2207 7=398/2224 Average=393/2223
Kmalloc N*alloc N*free(256): 0=1120/1038 1=1125/1168 2=1123/1153 3=1118/1156 4=1107/1134 5=1121/1190 6=1111/1087 7=1124/1026 Average=1119/1119
Kmalloc N*alloc N*free(512): 0=1691/1812 1=1728/1806 2=1710/1781 3=1718/1810 4=1719/1723 5=1727/1826 6=1710/1757 7=1712/1815 Average=1714/1791
Kmalloc N*alloc N*free(1024): 0=3169/3344 1=3162/3306 2=3134/3347 3=3162/3320 4=3166/3328 5=3179/3242 6=3147/3344 7=3160/3265 Average=3160/3312
Kmalloc N*alloc N*free(2048): 0=3366/3448 1=3382/3463 2=3371/3464 3=3386/3431 4=3375/3428 5=3371/3445 6=3379/3426 7=3385/3378 Average=3377/3435
Kmalloc N*alloc N*free(4096): 0=3761/8883 1=3780/8924 2=3768/8880 3=3779/8921 4=3775/8871 5=3781/8915 6=3770/8860 7=3780/8911 Average=3774/8896
Kmalloc N*(alloc free)(8): 0=94 1=97 2=94 3=96 4=95 5=97 6=94 7=97 Average=96
Kmalloc N*(alloc free)(16): 0=95 1=97 2=94 3=97 4=95 5=97 6=95 7=97 Average=96
Kmalloc N*(alloc free)(32): 0=101 1=98 2=94 3=98 4=95 5=98 6=95 7=98 Average=97
Kmalloc N*(alloc free)(64): 0=95 1=97 2=95 3=98 4=95 5=97 6=94 7=97 Average=96
Kmalloc N*(alloc free)(128): 0=100 1=97 2=94 3=97 4=95 5=97 6=95 7=96 Average=96
Kmalloc N*(alloc free)(256): 0=86 1=102 2=87 3=96 4=91 5=99 6=91 7=101 Average=94
Kmalloc N*(alloc free)(512): 0=103 1=98 2=95 3=98 4=101 5=99 6=101 7=104 Average=100
Kmalloc N*(alloc free)(1024): 0=102 1=102 2=99 3=96 4=100 5=100 6=101 7=96 Average=100
Kmalloc N*(alloc free)(2048): 0=99 1=102 2=98 3=103 4=100 5=102 6=100 7=95 Average=100
Kmalloc N*(alloc free)(4096): 0=98 1=103 2=98 3=104 4=101 5=102 6=97 7=102 Average=101

Remote free test
================

Before:

N*remote free(8): 0=6/987 1=120/0 2=117/0 3=117/0 4=115/0 5=119/0 6=115/0 7=119/0 Average=104/123
N*remote free(16): 0=6/1170 1=129/0 2=132/0 3=129/0 4=134/0 5=128/0 6=136/0 7=130/0 Average=115/146
N*remote free(32): 0=6/1447 1=177/0 2=164/0 3=174/0 4=177/0 5=175/0 6=178/0 7=175/0 Average=153/180
N*remote free(64): 0=6/2021 1=377/0 2=274/0 3=341/0 4=295/0 5=334/0 6=311/0 7=311/0 Average=281/252
N*remote free(128): 0=6/2415 1=588/0 2=454/0 3=588/0 4=469/0 5=600/0 6=526/0 7=582/0 Average=477/302
N*remote free(256): 0=6/2708 1=857/0 2=803/0 3=856/0 4=817/0 5=851/0 6=816/0 7=848/0 Average=732/338
N*remote free(512): 0=6/3110 1=1285/0 2=1265/0 3=1280/0 4=1268/0 5=1288/0 6=1262/0 7=1281/0 Average=1117/388
N*remote free(1024): 0=6/3756 1=2566/0 2=2554/0 3=2566/0 4=2552/0 5=2547/0 6=2542/0 7=2564/0 Average=2237/469
N*remote free(2048): 0=6/4253 1=2813/0 2=2816/0 3=2817/0 4=2786/0 5=2816/0 6=2789/0 7=2818/0 Average=2458/531
N*remote free(4096): 0=6/5138 1=3056/0 2=3048/0 3=3055/0 4=3030/0 5=3060/0 6=3032/0 7=3055/0 Average=2668/642

After:

N*remote free(8): 0=6/1008 1=82/0 2=81/0 3=84/0 4=84/0 5=84/0 6=84/0 7=83/0 Average=74/126
N*remote free(16): 0=6/1165 1=102/0 2=93/0 3=101/0 4=96/0 5=99/0 6=94/0 7=100/0 Average=86/145
N*remote free(32): 0=6/1462 1=150/0 2=135/0 3=150/0 4=143/0 5=153/0 6=145/0 7=148/0 Average=129/182
N*remote free(64): 0=6/2068 1=346/0 2=246/0 3=287/0 4=268/0 5=309/0 6=271/0 7=346/0 Average=260/258
N*remote free(128): 0=6/2425 1=577/0 2=468/0 3=555/0 4=531/0 5=552/0 6=464/0 7=560/0 Average=464/303
N*remote free(256): 0=6/2788 1=838/0 2=778/0 3=836/0 4=806/0 5=836/0 6=813/0 7=833/0 Average=718/348
N*remote free(512): 0=6/3147 1=1306/0 2=1269/0 3=1301/0 4=1273/0 5=1296/0 6=1289/0 7=1300/0 Average=1130/393
N*remote free(1024): 0=6/3785 1=2565/0 2=2535/0 3=2564/0 4=2524/0 5=2568/0 6=2545/0 7=2565/0 Average=2234/473
N*remote free(2048): 0=6/4284 1=2822/0 2=2819/0 3=2818/0 4=2797/0 5=2827/0 6=2810/0 7=2828/0 Average=2466/535
N*remote free(4096): 0=6/5177 1=3097/0 2=3086/0 3=3091/0 4=3064/0 5=3101/0 6=3072/0 7=3099/0 Average=2702/647

1 alloc N free test
===================

Before:

1 alloc N free(8): 0=1227 1=568 2=307 3=523 4=330 5=567 6=349 7=516 Average=548
1 alloc N free(16): 0=1444 1=625 2=588 3=627 4=587 5=627 6=600 7=616 Average=714
1 alloc N free(32): 0=1796 1=814 2=796 3=817 4=795 5=820 6=798 7=819 Average=932
1 alloc N free(64): 0=2906 1=1598 2=1595 3=1597 4=1596 5=1596 6=1595 7=1595 Average=1760
1 alloc N free(128): 0=4686 1=1815 2=1813 3=1813 4=1812 5=1813 6=1812 7=1813 Average=2172
1 alloc N free(256): 0=4884 1=1793 2=1788 3=1790 4=1788 5=1792 6=1788 7=1792 Average=2177
1 alloc N free(512): 0=4956 1=1780 2=1778 3=1780 4=1778 5=1778 6=1778 7=1778 Average=2176
1 alloc N free(1024): 0=5225 1=1696 2=1692 3=1699 4=1693 5=1698 6=1694 7=1698 Average=2137
1 alloc N free(2048): 0=5683 1=1822 2=1821 3=1822 4=1822 5=1821 6=1822 7=1821 Average=2304
1 alloc N free(4096): 0=6049 1=1477 2=1478 3=1478 4=1477 5=1478 6=1478 7=1478 Average=2049

After:
1 alloc N free(8): 0=932 1=381 2=259 3=407 4=301 5=414 6=304 7=414 Average=427
1 alloc N free(16): 0=1020 1=372 2=328 3=372 4=337 5=362 6=338 7=362 Average=436
1 alloc N free(32): 0=1628 1=847 2=826 3=846 4=825 5=840 6=825 7=844 Average=935
1 alloc N free(64): 0=2043 1=944 2=938 3=939 4=937 5=943 6=937 7=942 Average=1078
1 alloc N free(128): 0=4249 1=1597 2=1584 3=1592 4=1590 5=1595 6=1588 7=1593 Average=1923
1 alloc N free(256): 0=4529 1=1620 2=1614 3=1620 4=1614 5=1620 6=1615 7=1620 Average=1982
1 alloc N free(512): 0=4530 1=1548 2=1545 3=1547 4=1544 5=1545 6=1545 7=1545 Average=1919
1 alloc N free(1024): 0=5187 1=1822 2=1821 3=1823 4=1821 5=1823 6=1822 7=1822 Average=2243
1 alloc N free(2048): 0=5441 1=1803 2=1803 3=1799 4=1800 5=1802 6=1800 7=1801 Average=2256
1 alloc N free(4096): 0=5949 1=1580 2=1579 3=1579 4=1578 5=1579 6=1578 7=1578 Average=2125

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    3 
 mm/slub.c                |  171 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 149 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-11-26 10:45:54.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2010-11-26 11:18:07.000000000 -0600
@@ -36,7 +36,8 @@ enum stat_item {
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to first free per cpu object */
+	void **freelist;		/* Pointer to next available object */
+	unsigned long tid;	/* Globally unique transaction id */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-11-26 10:45:54.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-11-26 11:49:56.000000000 -0600
@@ -805,14 +805,20 @@ static inline void slab_post_alloc_hook(
 static inline void slab_free_hook(struct kmem_cache *s, void *x)
 {
 	kmemleak_free_recursive(x, s->flags);
-}
 
-static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
-{
+	/*
+	 * Trouble is that we no longer disable interupts in the fast path
+	 * So in order to make the debug calls that expect irqs to be
+	 * disabled we need to disable interrupts temporarily.
+	 */
+#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
+	local_irq_save(flags);
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
+	local_irq_restore(flags);
+#endif
 }
 
 /*
@@ -1099,9 +1105,6 @@ static inline void slab_post_alloc_hook(
 
 static inline void slab_free_hook(struct kmem_cache *s, void *x) {}
 
-static inline void slab_free_hook_irq(struct kmem_cache *s,
-		void *object) {}
-
 #endif /* CONFIG_SLUB_DEBUG */
 
 /*
@@ -1486,6 +1489,53 @@ static void unfreeze_slab(struct kmem_ca
 }
 
 /*
+ * Calculate the next globally unique transaction for disambiguiation
+ * during cmpxchg. The transactions start with the cpu number and are then
+ * incremented by CONFIG_NR_CPUS.
+ */
+#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
+
+static inline unsigned long next_tid(unsigned long tid)
+{
+	return tid + TID_STEP;
+}
+
+static inline unsigned int tid_to_cpu(unsigned long tid)
+{
+	return tid % TID_STEP;
+}
+
+static inline unsigned long tid_to_event(unsigned long tid)
+{
+	return tid / TID_STEP;
+}
+
+static inline unsigned int init_tid(int cpu)
+{
+	return cpu;
+}
+
+static void note_cmpxchg_failure(const char *n,
+		const struct kmem_cache *s, unsigned long tid)
+{
+#ifdef CONFIG_DEBUG_VM
+	unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
+
+	printk(KERN_INFO "%s %s: cmpxchg redo ", n, s->name);
+
+	if (tid_to_cpu(tid) != tid_to_cpu(actual_tid))
+		printk("due to cpu change %d -> %d\n",
+			tid_to_cpu(tid), tid_to_cpu(actual_tid));
+	else if (tid_to_event(tid) != tid_to_event(actual_tid))
+		printk("due to cpu running other code. Event %ld->%ld\n",
+			tid_to_event(tid), tid_to_event(actual_tid));
+	else
+		printk("for unknown reason: actual=%lx was=%lx target=%lx\n",
+			actual_tid, tid, next_tid(tid));
+#endif
+}
+
+/*
  * Remove the cpu slab
  */
 static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
@@ -1509,6 +1559,7 @@ static void deactivate_slab(struct kmem_
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
 		c->freelist = get_freepointer(s, c->freelist);
+		c->tid = next_tid(c->tid);
 
 		/* And put onto the regular freelist */
 		set_freepointer(s, object, page->freelist);
@@ -1646,10 +1697,19 @@ slab_out_of_memory(struct kmem_cache *s,
  * a call to the page allocator and the setup of a new slab.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr)
 {
 	void **object;
 	struct page *new;
+	unsigned long flags;
+	struct kmem_cache_cpu *c;
+
+	local_irq_save(flags);
+	/*
+	 * Interrupts are disabled. We will continue to run on this cpu
+	 * an no other code may interrupt.
+	 */
+	c = this_cpu_ptr(s->cpu_slab);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1675,7 +1735,9 @@ load_freelist:
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
 unlock_out:
+	c->tid = next_tid(c->tid);
 	slab_unlock(c->page);
+	local_irq_restore(flags);
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
@@ -1737,25 +1799,57 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-	unsigned long flags;
+	unsigned long tid;
 
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
-	local_irq_save(flags);
+redo:
+	/*
+	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
+	 * enabled. We may switch back and forth between cpus while
+	 * reading from one cpu area. That does not matter as long
+	 * as we end up on the original cpu again when doing the cmpxchg.
+	 */
 	c = __this_cpu_ptr(s->cpu_slab);
+
+	/*
+	 * The transaction ids are globally unique per cpu and per operation on
+	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
+	 * occurs on the right processor and that there was no operation on the
+	 * linked list in between.
+	 */
+	tid = c->tid;
+	barrier();
+
 	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	if (unlikely(!object || !node_match(c, c->node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, c->node, addr);
 
 	else {
-		c->freelist = get_freepointer(s, object);
+		/*
+		 * The cmpxchg will only match if there was not additonal
+		 * operation and if we are on the right processor.
+		 *
+		 * The cmpxchg does the following atomically (without lock semantics!)
+		 * 1. Relocate first pointer to the current per cpu area.
+		 * 2. Verify that tid and freelist have not been changed
+		 * 3. If they were not changed replace tid and freelist
+		 *
+		 * Since this is without lock semantics the protection is only against
+		 * code executing on this cpu *not* from access by other cpus.
+		 */
+		if (unlikely(!irqsafe_cmpxchg_double(&s->cpu_slab->freelist, object, tid,
+				get_freepointer(s, object), next_tid(tid)))) {
+
+			note_cmpxchg_failure("slab_alloc", s, tid);
+			goto redo;
+		}
 		stat(s, ALLOC_FASTPATH);
 	}
-	local_irq_restore(flags);
 
-	if (unlikely(gfpflags & __GFP_ZERO) && object)
+	if ((gfpflags & __GFP_ZERO) && object)
 		memset(object, 0, s->objsize);
 
 	slab_post_alloc_hook(s, gfpflags, object);
@@ -1817,8 +1911,10 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
+	unsigned long flags;
 
 	stat(s, FREE_SLOWPATH);
+	local_irq_save(flags);
 	slab_lock(page);
 
 	if (kmem_cache_debug(s))
@@ -1849,6 +1945,7 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
+	local_irq_restore(flags);
 	return;
 
 slab_empty:
@@ -1860,6 +1957,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -1886,23 +1984,36 @@ static __always_inline void slab_free(st
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
-	unsigned long flags;
+	unsigned long tid;
 
 	slab_free_hook(s, x);
 
-	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
+redo:
+	/*
+	 * Determine the currently cpus per cpu slab.
+	 * The cpu may change afterward. However that does not matter since
+	 * data is retrieved via this pointer. If we are on the same cpu
+	 * during the cmpxchg then the free will succedd.
+	 */
+	c = this_cpu_ptr(s->cpu_slab);
+	tid = c->tid;
+	barrier();
 
-	slab_free_hook_irq(s, x);
+	if (likely(page == c->page) &&
+			c->node != NUMA_NO_NODE) {
 
-	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
+
+		if (unlikely(!irqsafe_cmpxchg_double(&s->cpu_slab->freelist,
+				c->freelist, tid,
+				object, next_tid(tid)))) {
+
+			note_cmpxchg_failure("slab_free", s, tid);
+			goto redo;
+		}
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
-
-	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -2102,12 +2213,24 @@ init_kmem_cache_node(struct kmem_cache_n
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 {
+	int cpu;
+
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
 
-	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
+	/*
+	 * Must align to double word boundary for the long cmpxchg instructions
+	 * to work.
+	 */
+	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu), 2 * sizeof(void *));
 
-	return s->cpu_slab != NULL;
+	if (!s->cpu_slab)
+		return 0;
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
+
+	return 1;
 }
 
 static struct kmem_cache *kmem_cache_node;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations
  2010-11-26 21:09 ` [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations Christoph Lameter
@ 2010-11-27  6:30   ` Eric Dumazet
  2010-11-27 15:20   ` Mathieu Desnoyers
  1 sibling, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2010-11-27  6:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Mathieu Desnoyers, Tejun Heo

Le vendredi 26 novembre 2010 à 15:09 -0600, Christoph Lameter a écrit :
...

> +# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in
> +# al not via the ZF. Caller will access al to get result.
> +#
> +cmpxchg16b_local_emu:
> +	pushf
> +	cli
> +
> +	cmpq  %gs:(%rsi), %rax
> +	jne not_same
> +	cmpq %gs:8(%rsi), %rdx
> +	jne not_same
> +
> +	movq %rbx,  %gs:(%esi)

	(%rsi) instead of (%esi) ?

> +	movq %rcx, %gs:8(%esi)

	same here

> +
> +	popf
> +	mov $1, %al
> +	ret
> +
> + not_same:
> +	popf
> +	xor  %al,%al
> +	ret
> +
> +CFI_ENDPROC
> +ENDPROC(cmpxchg16b_local_emu)
> 




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-26 21:09 ` [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations Christoph Lameter
@ 2010-11-27  8:00   ` Pekka Enberg
  2010-11-27 14:49   ` Mathieu Desnoyers
  1 sibling, 0 replies; 36+ messages in thread
From: Pekka Enberg @ 2010-11-27  8:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

On Fri, Nov 26, 2010 at 11:09 PM, Christoph Lameter <cl@linux.com> wrote:
> this cpu operations can be used to slightly optimize the function. The
> changes will avoid some address calculations and replace them with the
> use of the percpu segment register.
>
> If one would have this_cpu_inc_return and this_cpu_dec_return then it
> would be possible to optimize inc_zone_page_state and dec_zone_page_state even
> more.
>
> V1->V2:
>        - Fix __dec_zone_state overflow handling
>        - Use s8 variables for temporary storage.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-26 21:09 ` [thisops uV2 04/10] x86: Support " Christoph Lameter
@ 2010-11-27  8:06   ` Pekka Enberg
  2010-11-29 16:03     ` Christoph Lameter
  2010-11-27 15:00   ` Mathieu Desnoyers
  1 sibling, 1 reply; 36+ messages in thread
From: Pekka Enberg @ 2010-11-27  8:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

On Fri, Nov 26, 2010 at 11:09 PM, Christoph Lameter <cl@linux.com> wrote:
> @@ -300,6 +339,14 @@ do {                                                                       \
>  #define irqsafe_cpu_xor_2(pcp, val)    percpu_to_op("xor", (pcp), val)
>  #define irqsafe_cpu_xor_4(pcp, val)    percpu_to_op("xor", (pcp), val)
>
> +#ifndef CONFIG_M386

Use CONFIG_X86_32 here?

> +#define __this_cpu_add_return_1(pcp, val)      percpu_add_return_op((pcp), val)
> +#define __this_cpu_add_return_2(pcp, val)      percpu_add_return_op((pcp), val)
> +#define __this_cpu_add_return_4(pcp, val)      percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_1(pcp, val)                percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_2(pcp, val)                percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_4(pcp, val)                percpu_add_return_op((pcp), val)
> +#endif
>  /*
>  * Per cpu atomic 64 bit operations are only available under 64 bit.
>  * 32 bit must fall back to generic operations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter
  2010-11-26 21:09 ` [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter Christoph Lameter
@ 2010-11-27  8:07   ` Pekka Enberg
  2010-11-27 15:00   ` Mathieu Desnoyers
  1 sibling, 0 replies; 36+ messages in thread
From: Pekka Enberg @ 2010-11-27  8:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

On Fri, Nov 26, 2010 at 11:09 PM, Christoph Lameter <cl@linux.com> wrote:
> this_cpu_inc_return() saves us a memory access there.
>
> Reviewed-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics
  2010-11-26 21:09 ` [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics Christoph Lameter
@ 2010-11-27  8:09   ` Pekka Enberg
  2010-11-29 16:04     ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Pekka Enberg @ 2010-11-27  8:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

On Fri, Nov 26, 2010 at 11:09 PM, Christoph Lameter <cl@linux.com> wrote:
> this_cpu_inc_return() saves us a memory access there. Code
> size does not change.
>
> V1->V2:
>        - Fixed the location of the __per_cpu pointer attributes
>        - Sparse checked
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> ---
>  mm/vmstat.c |   20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c  2010-11-26 10:35:45.000000000 -0600
> +++ linux-2.6/mm/vmstat.c       2010-11-26 10:45:49.000000000 -0600
> @@ -167,8 +167,8 @@ static void refresh_zone_stat_thresholds
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>                                int delta)
>  {
> -       struct per_cpu_pageset * __percpu pcp = zone->pageset;
> -       s8 * __percpu p = pcp->vm_stat_diff + item;
> +       struct per_cpu_pageset __percpu *pcp = zone->pageset;
> +       s8 __percpu *p = pcp->vm_stat_diff + item;
>        long x;
>        long t;
>

Shouldn't this hunk be in the other vmstat patch?

> @@ -223,13 +223,11 @@ EXPORT_SYMBOL(mod_zone_page_state);
>  */
>  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -       struct per_cpu_pageset * __percpu pcp = zone->pageset;
> -       s8 * __percpu p = pcp->vm_stat_diff + item;
> +       struct per_cpu_pageset __percpu *pcp = zone->pageset;
> +       s8 __percpu *p = pcp->vm_stat_diff + item;
>        s8 v, t;
>
> -       __this_cpu_inc(*p);
> -
> -       v = __this_cpu_read(*p);
> +       v = __this_cpu_inc_return(*p);
>        t = __this_cpu_read(pcp->stat_threshold);
>        if (unlikely(v > t)) {
>                s8 overstep = t >> 1;
> @@ -247,13 +245,11 @@ EXPORT_SYMBOL(__inc_zone_page_state);
>
>  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -       struct per_cpu_pageset * __percpu pcp = zone->pageset;
> -       s8 * __percpu p = pcp->vm_stat_diff + item;
> +       struct per_cpu_pageset __percpu *pcp = zone->pageset;
> +       s8 __percpu *p = pcp->vm_stat_diff + item;
>        s8 v, t;
>
> -       __this_cpu_dec(*p);
> -
> -       v = __this_cpu_read(*p);
> +       v = __this_cpu_dec_return(*p);
>        t = __this_cpu_read(pcp->stat_threshold);
>        if (unlikely(v < - t)) {
>                s8 overstep = t >> 1;

For the rest of this patch:

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options.
  2010-11-26 21:09 ` [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options Christoph Lameter
@ 2010-11-27 14:42   ` Mathieu Desnoyers
  0 siblings, 0 replies; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-27 14:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> The this_cpu_* options can be used to optimize __percpu_counter_add a bit. Avoids
> some address arithmetic and saves 12 bytes.
> 

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-26 21:09 ` [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations Christoph Lameter
  2010-11-27  8:00   ` Pekka Enberg
@ 2010-11-27 14:49   ` Mathieu Desnoyers
  2010-11-29 16:16     ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-27 14:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> this cpu operations can be used to slightly optimize the function. The
> changes will avoid some address calculations and replace them with the
> use of the percpu segment register.
> 
> If one would have this_cpu_inc_return and this_cpu_dec_return then it
> would be possible to optimize inc_zone_page_state and dec_zone_page_state even
> more.

Then we might want to directly target the implementation with
this_cpu_add_return/this_cpu_sub_return (you implement these in patch 03), which
would not need to disable preemption on the fast path. I think we already
discussed this in the past. The reason eludes me at the moment, but I remember
discussing that changing the increment/decrement delta to the nearest powers of
two would let us deal with overflow cleanly. But it's probably too early in the
morning for me to wrap my head around the issue at the moment.

Thanks,

Mathieu

> 
> V1->V2:
> 	- Fix __dec_zone_state overflow handling
> 	- Use s8 variables for temporary storage.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  mm/vmstat.c |   56 ++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 32 insertions(+), 24 deletions(-)
> 
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2010-11-24 13:38:34.000000000 -0600
> +++ linux-2.6/mm/vmstat.c	2010-11-24 15:03:08.000000000 -0600
> @@ -167,18 +167,20 @@ static void refresh_zone_stat_thresholds
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
>  {
> -	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> -
> -	s8 *p = pcp->vm_stat_diff + item;
> +	struct per_cpu_pageset * __percpu pcp = zone->pageset;
> +	s8 * __percpu p = pcp->vm_stat_diff + item;
>  	long x;
> +	long t;
> +
> +	x = delta + __this_cpu_read(*p);
>  
> -	x = delta + *p;
> +	t = __this_cpu_read(pcp->stat_threshold);
>  
> -	if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
> +	if (unlikely(x > t || x < -t)) {
>  		zone_page_state_add(x, zone, item);
>  		x = 0;
>  	}
> -	*p = x;
> +	__this_cpu_write(*p, x);
>  }
>  EXPORT_SYMBOL(__mod_zone_page_state);
>  
> @@ -221,16 +223,19 @@ EXPORT_SYMBOL(mod_zone_page_state);
>   */
>  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> -	s8 *p = pcp->vm_stat_diff + item;
> -
> -	(*p)++;
> +	struct per_cpu_pageset * __percpu pcp = zone->pageset;
> +	s8 * __percpu p = pcp->vm_stat_diff + item;
> +	s8 v, t;
> +
> +	__this_cpu_inc(*p);
> +
> +	v = __this_cpu_read(*p);
> +	t = __this_cpu_read(pcp->stat_threshold);
> +	if (unlikely(v > t)) {
> +		s8 overstep = t >> 1;
>  
> -	if (unlikely(*p > pcp->stat_threshold)) {
> -		int overstep = pcp->stat_threshold / 2;
> -
> -		zone_page_state_add(*p + overstep, zone, item);
> -		*p = -overstep;
> +		zone_page_state_add(v + overstep, zone, item);
> +		__this_cpu_write(*p, - overstep);
>  	}
>  }
>  
> @@ -242,16 +247,19 @@ EXPORT_SYMBOL(__inc_zone_page_state);
>  
>  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> -	s8 *p = pcp->vm_stat_diff + item;
> -
> -	(*p)--;
> -
> -	if (unlikely(*p < - pcp->stat_threshold)) {
> -		int overstep = pcp->stat_threshold / 2;
> +	struct per_cpu_pageset * __percpu pcp = zone->pageset;
> +	s8 * __percpu p = pcp->vm_stat_diff + item;
> +	s8 v, t;
> +
> +	__this_cpu_dec(*p);
> +
> +	v = __this_cpu_read(*p);
> +	t = __this_cpu_read(pcp->stat_threshold);
> +	if (unlikely(v < - t)) {
> +		s8 overstep = t >> 1;
>  
> -		zone_page_state_add(*p - overstep, zone, item);
> -		*p = overstep;
> +		zone_page_state_add(v - overstep, zone, item);
> +		__this_cpu_write(*p, overstep);
>  	}
>  }
>  
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return
  2010-11-26 21:09 ` [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return Christoph Lameter
@ 2010-11-27 14:58   ` Mathieu Desnoyers
  2010-11-29 16:08     ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-27 14:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> Introduce generic support for this_cpu_add_return etc.
> 
> The fallback is to realize these operations with __this_cpu_ops.
> 
> Reviewed-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  include/linux/percpu.h |   70 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
> 
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2010-11-23 17:29:46.000000000 -0600
> +++ linux-2.6/include/linux/percpu.h	2010-11-23 17:31:14.000000000 -0600
> @@ -240,6 +240,20 @@ extern void __bad_size_call_parameter(vo
>  	pscr_ret__;							\
>  })
>  
> +#define __pcpu_size_call_return2(stem, variable, ...)			\
> +({	typeof(variable) pscr_ret__;					\

isn't it usual to do ?

( {                               \
  typeof(variable) pscr_ret__;    \

instead ?


> +	__verify_pcpu_ptr(&(variable));					\
> +	switch(sizeof(variable)) {					\
> +	case 1: pscr_ret__ = stem##1(variable, __VA_ARGS__);break;	\

;break; at the EOL seems a bit odd. Maybe moving it to the next line ?

> +	case 2: pscr_ret__ = stem##2(variable, __VA_ARGS__);break;	\
> +	case 4: pscr_ret__ = stem##4(variable, __VA_ARGS__);break;	\
> +	case 8: pscr_ret__ = stem##8(variable, __VA_ARGS__);break;	\
> +	default:							\
> +		__bad_size_call_parameter();break;			\
> +	}								\
> +	pscr_ret__;							\
> +})
> +
>  #define __pcpu_size_call(stem, variable, ...)				\
>  do {									\
>  	__verify_pcpu_ptr(&(variable));					\
> @@ -529,6 +543,62 @@ do {									\
>  # define __this_cpu_xor(pcp, val)	__pcpu_size_call(__this_cpu_xor_, (pcp), (val))
>  #endif
>  
> +#define _this_cpu_generic_add_return(pcp, val)				\
> +({	typeof(pcp) ret__;						\

add newline after opening ({    \  ?

> +	preempt_disable();						\
> +	__this_cpu_add((pcp), val);					\

Hrm, you are inconsistent between your macros. Here you use "(pcp), " but above:
"(variable, ". I think the extra () are not needed in this case, so you might
want to consider removing these from (pcp).

> +	ret__ = __this_cpu_read((pcp));					\

Same here.

> +	preempt_enable();						\
> +	ret__;								\
> +})
> +
> +#ifndef this_cpu_add_return
> +# ifndef this_cpu_add_return_1
> +#  define this_cpu_add_return_1(pcp, val)	_this_cpu_generic_add_return(pcp, val)
> +# endif
> +# ifndef this_cpu_add_return_2
> +#  define this_cpu_add_return_2(pcp, val)	_this_cpu_generic_add_return(pcp, val)
> +# endif
> +# ifndef this_cpu_add_return_4
> +#  define this_cpu_add_return_4(pcp, val)	_this_cpu_generic_add_return(pcp, val)
> +# endif
> +# ifndef this_cpu_add_return_8
> +#  define this_cpu_add_return_8(pcp, val)	_this_cpu_generic_add_return(pcp, val)
> +# endif
> +# define this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, (pcp), val)

Same here.

> +#endif
> +
> +#define this_cpu_sub_return(pcp, val)	this_cpu_add_return(pcp, -(val))
> +#define this_cpu_inc_return(pcp)	this_cpu_add_return(pcp, 1)
> +#define this_cpu_dec_return(pcp)	this_cpu_add_return(pcp, -1)
> +
> +#define __this_cpu_generic_add_return(pcp, val)				\
> +({	typeof(pcp) ret__;						\
> +	__this_cpu_add((pcp), val);					\
> +	ret__ = __this_cpu_read((pcp));					\

Same for above 2 lines.

> +	ret__;								\
> +})
> +
> +#ifndef __this_cpu_add_return
> +# ifndef __this_cpu_add_return_1
> +#  define __this_cpu_add_return_1(pcp, val)	__this_cpu_generic_add_return(pcp, val)
> +# endif
> +# ifndef __this_cpu_add_return_2
> +#  define __this_cpu_add_return_2(pcp, val)	__this_cpu_generic_add_return(pcp, val)
> +# endif
> +# ifndef __this_cpu_add_return_4
> +#  define __this_cpu_add_return_4(pcp, val)	__this_cpu_generic_add_return(pcp, val)
> +# endif
> +# ifndef __this_cpu_add_return_8
> +#  define __this_cpu_add_return_8(pcp, val)	__this_cpu_generic_add_return(pcp, val)
> +# endif
> +# define __this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, (pcp), val)

Same here.

Other than that:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,

Mathieu

> +#endif
> +
> +#define __this_cpu_sub_return(pcp, val)	this_cpu_add_return(pcp, -(val))
> +#define __this_cpu_inc_return(pcp)	this_cpu_add_return(pcp, 1)
> +#define __this_cpu_dec_return(pcp)	this_cpu_add_return(pcp, -1)
> +
>  /*
>   * IRQ safe versions of the per cpu RMW operations. Note that these operations
>   * are *not* safe against modification of the same variable from another
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-26 21:09 ` [thisops uV2 04/10] x86: Support " Christoph Lameter
  2010-11-27  8:06   ` Pekka Enberg
@ 2010-11-27 15:00   ` Mathieu Desnoyers
  2010-11-29 16:31     ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-27 15:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> Supply an implementation for x86 in order to generate more efficient code.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  arch/x86/include/asm/percpu.h |   50 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-11-23 16:35:19.000000000 -0600
> +++ linux-2.6/arch/x86/include/asm/percpu.h	2010-11-23 16:36:05.000000000 -0600
> @@ -177,6 +177,45 @@ do {									\
>  	}								\
>  } while (0)
>  
> +
> +/*
> + * Add return operation
> + */
> +#define percpu_add_return_op(var, val)					\
> +({									\
> +	typedef typeof(var) pao_T__;					\
> +	typeof(var) pfo_ret__ = val;					\
> +	if (0) {							\
> +		pao_T__ pao_tmp__;					\
> +		pao_tmp__ = (val);					\
> +		(void)pao_tmp__;					\
> +	}								\

OK, I'm dumb: why is the above needed ?

> +	switch (sizeof(var)) {						\
> +	case 1:								\
> +		asm("xaddb %0, "__percpu_arg(1)				\
> +			    : "+q" (pfo_ret__), "+m" (var)		\
> +			    : : "memory");				\
> +		break;							\
> +	case 2:								\
> +		asm("xaddw %0, "__percpu_arg(1)				\
> +			    : "+r" (pfo_ret__), "+m" (var)		\
> +			    : : "memory");				\
> +		break;							\
> +	case 4:								\
> +		asm("xaddl %0, "__percpu_arg(1)				\
> +			    : "+r"(pfo_ret__), "+m" (var)		\
> +			    : : "memory");				\
> +		break;							\
> +	case 8:								\
> +		asm("xaddq %0, "__percpu_arg(1)				\
> +			    : "+re" (pfo_ret__),  "+m" (var)		\
> +			    : : "memory");				\
> +		break;							\
> +	default: __bad_percpu_size();					\
> +	}								\
> +	pfo_ret__ + (val);						\
> +})
> +
>  #define percpu_from_op(op, var, constraint)		\
>  ({							\
>  	typeof(var) pfo_ret__;				\
> @@ -300,6 +339,14 @@ do {									\
>  #define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
>  #define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)

(pcp) -> pcp. Same for other similar cases below.

Thanks,

Mathieu

>  
> +#ifndef CONFIG_M386
> +#define __this_cpu_add_return_1(pcp, val)	percpu_add_return_op((pcp), val)
> +#define __this_cpu_add_return_2(pcp, val)	percpu_add_return_op((pcp), val)
> +#define __this_cpu_add_return_4(pcp, val)	percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_1(pcp, val)		percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_2(pcp, val)		percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_4(pcp, val)		percpu_add_return_op((pcp), val)
> +#endif
>  /*
>   * Per cpu atomic 64 bit operations are only available under 64 bit.
>   * 32 bit must fall back to generic operations.
> @@ -324,6 +371,9 @@ do {									\
>  #define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
>  #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
>  
> +#define __this_cpu_add_return_8(pcp, val)	percpu_add_return_op((pcp), val)
> +#define this_cpu_add_return_8(pcp, val)	percpu_add_return_op((pcp), val)
> +
>  #endif
>  
>  /* This is not atomic against other CPUs -- CPU preemption needs to be off */
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter
  2010-11-26 21:09 ` [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter Christoph Lameter
  2010-11-27  8:07   ` Pekka Enberg
@ 2010-11-27 15:00   ` Mathieu Desnoyers
  1 sibling, 0 replies; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-27 15:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> this_cpu_inc_return() saves us a memory access there.
> 

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Reviewed-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  arch/x86/kernel/apic/nmi.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> Index: linux-2.6/arch/x86/kernel/apic/nmi.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/apic/nmi.c	2010-11-23 16:35:19.000000000 -0600
> +++ linux-2.6/arch/x86/kernel/apic/nmi.c	2010-11-23 16:38:29.000000000 -0600
> @@ -432,8 +432,7 @@ nmi_watchdog_tick(struct pt_regs *regs,
>  		 * Ayiee, looks like this CPU is stuck ...
>  		 * wait a few IRQs (5 seconds) before doing the oops ...
>  		 */
> -		__this_cpu_inc(alert_counter);
> -		if (__this_cpu_read(alert_counter) == 5 * nmi_hz)
> +		if (__this_cpu_inc_return(alert_counter) == 5 * nmi_hz)
>  			/*
>  			 * die_nmi will return ONLY if NOTIFY_STOP happens..
>  			 */
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations
  2010-11-26 21:09 ` [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations Christoph Lameter
  2010-11-27  6:30   ` Eric Dumazet
@ 2010-11-27 15:20   ` Mathieu Desnoyers
  2010-11-29 16:11     ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-27 15:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> Provide support as far as the hardware capabilities of the x86 cpus
> allow.
> 
> V1->V2:
> 	- Mark %rdx clobbering during cmpxchg16b
> 	- Provide emulation of cmpxchg16b for early AMD processors
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>

[...]
> +/*
> + * Something is screwed up with alternate instruction creation. This one
> + * fails with a mysterious asm error about a byte val > 255.

Hrm, "interesting" ;) Does each of these work if expressed directly in inline
assembly without the alternatives ? Can you give us more information about the
failure ?

> + */
> +#define percpu_cmpxchg16b(pcp, o1, o2, n1, n2)			\
> +({									\
> +	char __ret;							\
> +	typeof(o1) __o1 = o1;						\
> +	typeof(o1) __n1 = n1;						\
> +	typeof(o2) __o2 = o2;						\
> +	typeof(o2) __n2 = n2;						\
> +	typeof(o2) __dummy;						\
> +	VM_BUG_ON(((unsigned long)pcp) % 16);				\

Restricting typing on "pcp" at build time might be more appropriate. E.g.
creating a "struct doublecas" that would be forcefully aligned on 16 bytes. We
could check that with __builtin_types_compatible_p and generate a build failure
if necessary.

Thanks,

Mathieu

> +	alternative_io("call cmpxchg16b_local\n\t" P6_NOP4,		\
> +			"cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t",	\
> +			X86_FEATURE_CX16,				\
> +		    	ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)),	\
> +		        "S" (pcp), "b"(__n1), "c"(__n2),		\
> +			 "a"(__o1), "d"(__o2));				\
> +	__ret;								\
> +})

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-27  8:06   ` Pekka Enberg
@ 2010-11-29 16:03     ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 16:03 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: Type: TEXT/PLAIN, Size: 535 bytes --]

On Sat, 27 Nov 2010, Pekka Enberg wrote:

> On Fri, Nov 26, 2010 at 11:09 PM, Christoph Lameter <cl@linux.com> wrote:
> > @@ -300,6 +339,14 @@ do {                                                                       \
> >  #define irqsafe_cpu_xor_2(pcp, val)    percpu_to_op("xor", (pcp), val)
> >  #define irqsafe_cpu_xor_4(pcp, val)    percpu_to_op("xor", (pcp), val)
> >
> > +#ifndef CONFIG_M386
>
> Use CONFIG_X86_32 here?

This is a section specifically for non 386. 386 does not support certain
atomic operations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics
  2010-11-27  8:09   ` Pekka Enberg
@ 2010-11-29 16:04     ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 16:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Mathieu Desnoyers,
	Tejun Heo

[-- Attachment #1: Type: TEXT/PLAIN, Size: 651 bytes --]

On Sat, 27 Nov 2010, Pekka Enberg wrote:

> > @@ -167,8 +167,8 @@ static void refresh_zone_stat_thresholds
> >  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> >                                int delta)
> >  {
> > -       struct per_cpu_pageset * __percpu pcp = zone->pageset;
> > -       s8 * __percpu p = pcp->vm_stat_diff + item;
> > +       struct per_cpu_pageset __percpu *pcp = zone->pageset;
> > +       s8 __percpu *p = pcp->vm_stat_diff + item;
> >        long x;
> >        long t;
> >
>
> Shouldn't this hunk be in the other vmstat patch?

Right. I did not run sparse with the first patch alone.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return
  2010-11-27 14:58   ` Mathieu Desnoyers
@ 2010-11-29 16:08     ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 16:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Sat, 27 Nov 2010, Mathieu Desnoyers wrote:

> >
> > +#define __pcpu_size_call_return2(stem, variable, ...)			\
> > +({	typeof(variable) pscr_ret__;					\
>
> isn't it usual to do ?
>
> ( {                               \
>   typeof(variable) pscr_ret__;    \
>
> instead ?

Not in macros like this as far as I can tell but I can change it.

> > +	switch(sizeof(variable)) {					\
> > +	case 1: pscr_ret__ = stem##1(variable, __VA_ARGS__);break;	\
>
> ;break; at the EOL seems a bit odd. Maybe moving it to the next line ?

Again this seems to be common in macros like this.

>
> > +	preempt_disable();						\
> > +	__this_cpu_add((pcp), val);					\
>
> Hrm, you are inconsistent between your macros. Here you use "(pcp), " but above:
> "(variable, ". I think the extra () are not needed in this case, so you might
> want to consider removing these from (pcp).

Hmmm... I will try to straighten that out.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations
  2010-11-27 15:20   ` Mathieu Desnoyers
@ 2010-11-29 16:11     ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 16:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Sat, 27 Nov 2010, Mathieu Desnoyers wrote:

> > + */
> > +#define percpu_cmpxchg16b(pcp, o1, o2, n1, n2)			\
> > +({									\
> > +	char __ret;							\
> > +	typeof(o1) __o1 = o1;						\
> > +	typeof(o1) __n1 = n1;						\
> > +	typeof(o2) __o2 = o2;						\
> > +	typeof(o2) __n2 = n2;						\
> > +	typeof(o2) __dummy;						\
> > +	VM_BUG_ON(((unsigned long)pcp) % 16);				\
>
> Restricting typing on "pcp" at build time might be more appropriate. E.g.
> creating a "struct doublecas" that would be forcefully aligned on 16 bytes. We
> could check that with __builtin_types_compatible_p and generate a build failure
> if necessary.

That would in turn restrict the types which can be used with
cmpxchg_double. The existing slub implementation already uses a mixture of
pointer and unsigned long. And there are 32 bit and 64 bit variants
already.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-27 14:49   ` Mathieu Desnoyers
@ 2010-11-29 16:16     ` Christoph Lameter
  2010-11-29 17:13       ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 16:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Sat, 27 Nov 2010, Mathieu Desnoyers wrote:

> * Christoph Lameter (cl@linux.com) wrote:
> > this cpu operations can be used to slightly optimize the function. The
> > changes will avoid some address calculations and replace them with the
> > use of the percpu segment register.
> >
> > If one would have this_cpu_inc_return and this_cpu_dec_return then it
> > would be possible to optimize inc_zone_page_state and dec_zone_page_state even
> > more.
>
> Then we might want to directly target the implementation with
> this_cpu_add_return/this_cpu_sub_return (you implement these in patch 03), which
> would not need to disable preemption on the fast path. I think we already
> discussed this in the past. The reason eludes me at the moment, but I remember
> discussing that changing the increment/decrement delta to the nearest powers of
> two would let us deal with overflow cleanly. But it's probably too early in the
> morning for me to wrap my head around the issue at the moment.

We still would need to disable preemption because multiple this_cpu_ops
are needed. The only way to avoid preemption would be if the modifications
to the differential and the adding to the per zone counters would be
per cpu atomic.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-27 15:00   ` Mathieu Desnoyers
@ 2010-11-29 16:31     ` Christoph Lameter
  2010-11-29 18:33       ` Mathieu Desnoyers
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 16:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Sat, 27 Nov 2010, Mathieu Desnoyers wrote:

> > +/*
> > + * Add return operation
> > + */
> > +#define percpu_add_return_op(var, val)					\
> > +({									\
> > +	typedef typeof(var) pao_T__;					\
> > +	typeof(var) pfo_ret__ = val;					\
> > +	if (0) {							\
> > +		pao_T__ pao_tmp__;					\
> > +		pao_tmp__ = (val);					\
> > +		(void)pao_tmp__;					\
> > +	}								\
>
> OK, I'm dumb: why is the above needed ?

Ensure that the compiler agrees that  *var and val are compatible. Taken
over from percpu_add_op().


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-29 16:16     ` Christoph Lameter
@ 2010-11-29 17:13       ` Christoph Lameter
  2010-11-29 19:28         ` Mathieu Desnoyers
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 17:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

We could do this with local cmpxchgs like in the following patch. This
would avoid preemption disable and interrupt disable (at least on x86).
Trouble is how do we make this fit for architectures that do not have
cmpxchg?


Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-29 10:58:52.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-29 11:11:34.000000000 -0600
@@ -169,18 +169,23 @@ void __mod_zone_page_state(struct zone *
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long x;
-	long t;
+	long o, n, t, z;

-	x = delta + __this_cpu_read(*p);
+	do {
+		z = 0;
+		t = this_cpu_read(pcp->stat_threshold);
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			/* Overflow must be added to zone counters */
+			z = n;
+			n = 0;
+		}
+	} while (o != n && this_cpu_cmpxchg(*p, o, n) != o);

-	t = __this_cpu_read(pcp->stat_threshold);
-
-	if (unlikely(x > t || x < -t)) {
-		zone_page_state_add(x, zone, item);
-		x = 0;
-	}
-	__this_cpu_write(*p, x);
+	if (z)
+		zone_page_state_add(z, zone, item);
 }
 EXPORT_SYMBOL(__mod_zone_page_state);

@@ -190,11 +195,7 @@ EXPORT_SYMBOL(__mod_zone_page_state);
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 					int delta)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
 	__mod_zone_page_state(zone, item, delta);
-	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(mod_zone_page_state);



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-29 16:31     ` Christoph Lameter
@ 2010-11-29 18:33       ` Mathieu Desnoyers
  2010-11-29 18:54         ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-29 18:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> On Sat, 27 Nov 2010, Mathieu Desnoyers wrote:
> 
> > > +/*
> > > + * Add return operation
> > > + */
> > > +#define percpu_add_return_op(var, val)					\
> > > +({									\
> > > +	typedef typeof(var) pao_T__;					\
> > > +	typeof(var) pfo_ret__ = val;					\
> > > +	if (0) {							\
> > > +		pao_T__ pao_tmp__;					\
> > > +		pao_tmp__ = (val);					\
> > > +		(void)pao_tmp__;					\
> > > +	}								\
> >
> > OK, I'm dumb: why is the above needed ?
> 
> Ensure that the compiler agrees that  *var and val are compatible. Taken
> over from percpu_add_op().

Isn't that the purpose of __builtin_types_compatible_p(t1, t2) ?

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-29 18:33       ` Mathieu Desnoyers
@ 2010-11-29 18:54         ` Christoph Lameter
  2010-11-29 19:22           ` Mathieu Desnoyers
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 18:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Mon, 29 Nov 2010, Mathieu Desnoyers wrote:

> > > > +#define percpu_add_return_op(var, val)					\
> > > > +({									\
> > > > +	typedef typeof(var) pao_T__;					\
> > > > +	typeof(var) pfo_ret__ = val;					\
> > > > +	if (0) {							\
> > > > +		pao_T__ pao_tmp__;					\
> > > > +		pao_tmp__ = (val);					\
> > > > +		(void)pao_tmp__;					\
> > > > +	}								\
> > >
> > > OK, I'm dumb: why is the above needed ?
> >
> > Ensure that the compiler agrees that  *var and val are compatible. Taken
> > over from percpu_add_op().
>
> Isn't that the purpose of __builtin_types_compatible_p(t1, t2) ?

We also have a __same_type() macro in linux/compiler.h... But if I use
that the kernel build fails. I guess the check is too strict.

---
 arch/x86/include/asm/percpu.h |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-11-29 12:46:50.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2010-11-29 12:52:25.000000000 -0600
@@ -127,11 +127,7 @@ do {									\
 	typedef typeof(var) pao_T__;					\
 	const int pao_ID__ = (__builtin_constant_p(val) &&		\
 			      ((val) == 1 || (val) == -1)) ? (val) : 0;	\
-	if (0) {							\
-		pao_T__ pao_tmp__;					\
-		pao_tmp__ = (val);					\
-		(void)pao_tmp__;					\
-	}								\
+	BUILD_BUG_ON(!__same_type(var, val));				\
 	switch (sizeof(var)) {						\
 	case 1:								\
 		if (pao_ID__ == 1)					\



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-29 18:54         ` Christoph Lameter
@ 2010-11-29 19:22           ` Mathieu Desnoyers
  2010-11-29 20:09             ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-29 19:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> On Mon, 29 Nov 2010, Mathieu Desnoyers wrote:
> 
> > > > > +#define percpu_add_return_op(var, val)					\
> > > > > +({									\
> > > > > +	typedef typeof(var) pao_T__;					\
> > > > > +	typeof(var) pfo_ret__ = val;					\
> > > > > +	if (0) {							\
> > > > > +		pao_T__ pao_tmp__;					\
> > > > > +		pao_tmp__ = (val);					\
> > > > > +		(void)pao_tmp__;					\
> > > > > +	}								\
> > > >
> > > > OK, I'm dumb: why is the above needed ?
> > >
> > > Ensure that the compiler agrees that  *var and val are compatible. Taken
> > > over from percpu_add_op().
> >
> > Isn't that the purpose of __builtin_types_compatible_p(t1, t2) ?
> 
> We also have a __same_type() macro in linux/compiler.h... But if I use
> that the kernel build fails. I guess the check is too strict.

Ah, ok, so you don't mind if var and val have different types. You just want
the compiler to be able to cast one into the other without complaining.

Isn't it already checked by "typeof(var) pfo_ret__ = val;" ?

Which semantic do you expect when using different types for var and val ?

The assembly all acts on the following pfo_ret__ :

  typeof(var) pfo_ret__ = val;

But at the end, you return :

  pfo_ret__ + (val);

So if pfo_ret has a type smaller than val (e.g. unsigned short vs unsigned int),
unless I'm mistakened, the type returned by percpu_add_return_op will be an
unsigned int (the larger of the two), but the atomic operation is performed on
an unsigned short. This might cause confusion for the caller. Casting

  pfo_ret__ + (typeof(var)) (val);

might be appropriate.

Thanks,

Mathieu

> 
> ---
>  arch/x86/include/asm/percpu.h |    6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-11-29 12:46:50.000000000 -0600
> +++ linux-2.6/arch/x86/include/asm/percpu.h	2010-11-29 12:52:25.000000000 -0600
> @@ -127,11 +127,7 @@ do {									\
>  	typedef typeof(var) pao_T__;					\
>  	const int pao_ID__ = (__builtin_constant_p(val) &&		\
>  			      ((val) == 1 || (val) == -1)) ? (val) : 0;	\
> -	if (0) {							\
> -		pao_T__ pao_tmp__;					\
> -		pao_tmp__ = (val);					\
> -		(void)pao_tmp__;					\
> -	}								\
> +	BUILD_BUG_ON(!__same_type(var, val));				\
>  	switch (sizeof(var)) {						\
>  	case 1:								\
>  		if (pao_ID__ == 1)					\
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-29 17:13       ` Christoph Lameter
@ 2010-11-29 19:28         ` Mathieu Desnoyers
  2010-11-29 20:07           ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-29 19:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

* Christoph Lameter (cl@linux.com) wrote:
> We could do this with local cmpxchgs like in the following patch. This
> would avoid preemption disable and interrupt disable (at least on x86).
> Trouble is how do we make this fit for architectures that do not have
> cmpxchg?

All architectures should have a fallback nowadays, no ? This might involve
disabling interrupts around a cmpxchg emulation, which would make the slow path
disable/enable interrupts twice. Is it what you are concerned about ?

Thanks,

Matheu

> 
> 
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2010-11-29 10:58:52.000000000 -0600
> +++ linux-2.6/mm/vmstat.c	2010-11-29 11:11:34.000000000 -0600
> @@ -169,18 +169,23 @@ void __mod_zone_page_state(struct zone *
>  {
>  	struct per_cpu_pageset __percpu *pcp = zone->pageset;
>  	s8 __percpu *p = pcp->vm_stat_diff + item;
> -	long x;
> -	long t;
> +	long o, n, t, z;
> 
> -	x = delta + __this_cpu_read(*p);
> +	do {
> +		z = 0;
> +		t = this_cpu_read(pcp->stat_threshold);
> +		o = this_cpu_read(*p);
> +		n = delta + o;
> +
> +		if (n > t || n < -t) {
> +			/* Overflow must be added to zone counters */
> +			z = n;
> +			n = 0;
> +		}
> +	} while (o != n && this_cpu_cmpxchg(*p, o, n) != o);
> 
> -	t = __this_cpu_read(pcp->stat_threshold);
> -
> -	if (unlikely(x > t || x < -t)) {
> -		zone_page_state_add(x, zone, item);
> -		x = 0;
> -	}
> -	__this_cpu_write(*p, x);
> +	if (z)
> +		zone_page_state_add(z, zone, item);
>  }
>  EXPORT_SYMBOL(__mod_zone_page_state);
> 
> @@ -190,11 +195,7 @@ EXPORT_SYMBOL(__mod_zone_page_state);
>  void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  					int delta)
>  {
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
>  	__mod_zone_page_state(zone, item, delta);
> -	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL(mod_zone_page_state);
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-29 19:28         ` Mathieu Desnoyers
@ 2010-11-29 20:07           ` Christoph Lameter
  2010-11-29 20:59             ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 20:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Mon, 29 Nov 2010, Mathieu Desnoyers wrote:

> * Christoph Lameter (cl@linux.com) wrote:
> > We could do this with local cmpxchgs like in the following patch. This
> > would avoid preemption disable and interrupt disable (at least on x86).
> > Trouble is how do we make this fit for architectures that do not have
> > cmpxchg?
>
> All architectures should have a fallback nowadays, no ? This might involve
> disabling interrupts around a cmpxchg emulation, which would make the slow path
> disable/enable interrupts twice. Is it what you are concerned about ?

We are adding new per cpu atomic functionality here and we are
establishing the fallbacks as we go.

Fallbacks are to an cmpxchg emulation using interrupt disable etc of
course but the performance may be lower than the current version. I need
to get some numbers to assess the impact. In the meantime I have a full
cmpxchg based vmstat implementation here that seems to work without a
problem:

Subject: vmstat: User per cpu atomics to avoid interrupt and preempt disable

Currently the operations to increment vm counters must disable preempt and/or
interrupts in order to not mess up their housekeeping of counters. Both measures
have disadvantages. Interrupt disable causes a lot of overhead. Disabling
preemption is something that the RT folks do not like.

So use this_cpu_cmpxchg() to avoid any of those things. The fetching of the
counter thresholds is racy. A threshold from another cpu may be applied
if we happen to be rescheduled on another cpu.  However, the following
vmstat operation will then bring the counter again under the
threshold limit.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/vmstat.c |  145 +++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 86 insertions(+), 59 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-11-29 12:31:20.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2010-11-29 12:36:42.000000000 -0600
@@ -162,27 +162,87 @@ static void refresh_zone_stat_thresholds
 }

 /*
- * For use when we know that interrupts are disabled.
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *	0	No overstepping
+ *	1	Overstepping half of threshold
+ *	-1	Overstepping minus half of threshold
  */
-void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-				int delta)
+static inline void mod_state(struct zone *zone,
+	enum zone_stat_item item, int delta, int overstep_mode)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long x;
-	long t;
+	long o, n, t, z;

-	x = delta + __this_cpu_read(*p);
+	do {
+		z = 0;	/* overflow to zone counters */

-	t = __this_cpu_read(pcp->stat_threshold);
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the following
+		 * will apply the threshold again and therefore bring the
+		 * counter under the threshold.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			int os = overstep_mode * (t >> 1) ;
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (o != n && irqsafe_cpu_cmpxchg(*p, o, n) != o);

-	if (unlikely(x > t || x < -t)) {
-		zone_page_state_add(x, zone, item);
-		x = 0;
-	}
-	__this_cpu_write(*p, x);
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+/*
+ * Variant for the case where we know that the preemption /interrupt
+ * has been taken care of.
+ */
+static inline void __mod_state(struct zone *zone,
+	enum zone_stat_item item, int delta, int overstep_mode)
+{
+	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;	/* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the following
+		 * will apply the threshold again and therefore bring the
+		 * counter under the threshold.
+		 */
+		t = __this_cpu_read(pcp->stat_threshold);
+
+		o = __this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			int os = overstep_mode * (t >> 1) ;
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (o != n && __this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
 }
-EXPORT_SYMBOL(__mod_zone_page_state);

 /*
  * For an unknown interrupt state
@@ -190,14 +250,17 @@ EXPORT_SYMBOL(__mod_zone_page_state);
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 					int delta)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__mod_zone_page_state(zone, item, delta);
-	local_irq_restore(flags);
+	mod_state(zone, item, delta, 0);
 }
 EXPORT_SYMBOL(mod_zone_page_state);

+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+					int delta)
+{
+	__mod_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
 /*
  * Optimized increment and decrement functions.
  *
@@ -223,18 +286,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
-
-	v = __this_cpu_inc_return(*p);
-	t = __this_cpu_read(pcp->stat_threshold);
-	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
-
-		zone_page_state_add(v + overstep, zone, item);
-		__this_cpu_write(*p, - overstep);
-	}
+	__mod_state(zone, item, 1, 1);
 }

 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -245,18 +297,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);

 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
-
-	v = __this_cpu_dec_return(*p);
-	t = __this_cpu_read(pcp->stat_threshold);
-	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
-
-		zone_page_state_add(v - overstep, zone, item);
-		__this_cpu_write(*p, overstep);
-	}
+	__mod_state(zone, item, -1, -1);
 }

 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -267,32 +308,18 @@ EXPORT_SYMBOL(__dec_zone_page_state);

 void inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__inc_zone_state(zone, item);
-	local_irq_restore(flags);
+	mod_state(zone, item, 1, 1);
 }

 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	unsigned long flags;
-	struct zone *zone;
-
-	zone = page_zone(page);
-	local_irq_save(flags);
-	__inc_zone_state(zone, item);
-	local_irq_restore(flags);
+	mod_state(page_zone(page), item, 1, 1);
 }
 EXPORT_SYMBOL(inc_zone_page_state);

 void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__dec_zone_page_state(page, item);
-	local_irq_restore(flags);
+	mod_state(page_zone(page), item, -1, -1);
 }
 EXPORT_SYMBOL(dec_zone_page_state);




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 04/10] x86: Support for this_cpu_add,sub,dec,inc_return
  2010-11-29 19:22           ` Mathieu Desnoyers
@ 2010-11-29 20:09             ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 20:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

On Mon, 29 Nov 2010, Mathieu Desnoyers wrote:

> Ah, ok, so you don't mind if var and val have different types. You just want
> the compiler to be able to cast one into the other without complaining.
>
> Isn't it already checked by "typeof(var) pfo_ret__ = val;" ?

Right. I will drop that strange fragment.

> But at the end, you return :
>
>   pfo_ret__ + (val);
>
> So if pfo_ret has a type smaller than val (e.g. unsigned short vs unsigned int),
> unless I'm mistakened, the type returned by percpu_add_return_op will be an
> unsigned int (the larger of the two), but the atomic operation is performed on
> an unsigned short. This might cause confusion for the caller. Casting
>
>   pfo_ret__ + (typeof(var)) (val);
>
> might be appropriate.

The return needs to be constrained to pfo_ret__. If the value wraps around
then we do not want a value larger than the variable returned.

	pfo_ret__ += val;
	pfo_ret;

will do I think.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations
  2010-11-29 20:07           ` Christoph Lameter
@ 2010-11-29 20:59             ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-11-29 20:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, Tejun Heo

Some numbers for the cmpxchg implementation (cycles)

Function		Orig	Cmpxchg Fallback
--------------------------------------------------
inc_zone_page_state	170	32	196
__inc_zone_page_s	23	31	26
inc/dec pairs		347	69	379

So the fallback is always worse. cmpxchg is only better for the versions
of zone counters where we need to disable and enable interrupts.

This would suggest to only use the cmpxchg for arches that have cmpxchg
local and only in the case of the full versions.

There are 12 cases of inc_zone_page_state and 13 of dec_zone_page_state
as well as 7 cases of mod_zone_page_state.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-11-29 20:59 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-26 21:09 [thisops uV2 00/10] Upgrade of this_cpu_ops V2 Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 01/10] percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options Christoph Lameter
2010-11-27 14:42   ` Mathieu Desnoyers
2010-11-26 21:09 ` [thisops uV2 02/10] vmstat: Optimize zone counter modifications through the use of this cpu operations Christoph Lameter
2010-11-27  8:00   ` Pekka Enberg
2010-11-27 14:49   ` Mathieu Desnoyers
2010-11-29 16:16     ` Christoph Lameter
2010-11-29 17:13       ` Christoph Lameter
2010-11-29 19:28         ` Mathieu Desnoyers
2010-11-29 20:07           ` Christoph Lameter
2010-11-29 20:59             ` Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 03/10] percpu: Generic support for this_cpu_add,sub,dec,inc_return Christoph Lameter
2010-11-27 14:58   ` Mathieu Desnoyers
2010-11-29 16:08     ` Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 04/10] x86: Support " Christoph Lameter
2010-11-27  8:06   ` Pekka Enberg
2010-11-29 16:03     ` Christoph Lameter
2010-11-27 15:00   ` Mathieu Desnoyers
2010-11-29 16:31     ` Christoph Lameter
2010-11-29 18:33       ` Mathieu Desnoyers
2010-11-29 18:54         ` Christoph Lameter
2010-11-29 19:22           ` Mathieu Desnoyers
2010-11-29 20:09             ` Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 05/10] x86: Use this_cpu_inc_return for nmi counter Christoph Lameter
2010-11-27  8:07   ` Pekka Enberg
2010-11-27 15:00   ` Mathieu Desnoyers
2010-11-26 21:09 ` [thisops uV2 06/10] vmstat: Use this_cpu_inc_return for vm statistics Christoph Lameter
2010-11-27  8:09   ` Pekka Enberg
2010-11-29 16:04     ` Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 07/10] highmem: Use this_cpu_xx_return() operations Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 08/10] percpu: generic this_cpu_cmpxchg() and this_cpu_cmpxchg_double support Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 09/10] x86: this_cpu_cmpxchg and this_cpu_cmpxchg_double operations Christoph Lameter
2010-11-27  6:30   ` Eric Dumazet
2010-11-27 15:20   ` Mathieu Desnoyers
2010-11-29 16:11     ` Christoph Lameter
2010-11-26 21:09 ` [thisops uV2 10/10] slub: Lockless fastpaths Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox