[cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support
@ 2011-02-25 17:38 Christoph Lameter
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 1/5] slub: min_partial needs to be in first cacheline Christoph Lameter
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

This patch series introduces this_cpu_cmpxchg_double().

x86 cpus support cmpxchg16b and cmpxchg8b instuction which are capable of
switching two words instead of one during a cmpxchg.
Two words allow to swap more state in an atomic instruction.

this_cpu_cmpxchg_double() is used in the slub allocator to avoid
interrupt disable/enable in both alloc and free fastpath.
Using the new operation significantly speeds up the fastpaths.

V1->V2
	- Change parameter convention for this_cpu_cmpxchg_double. Specify both
	  percpu variables in same way as the two old and new values.
	- Do not require a per cpu pointer but a variable to conform to the
	  convention used in other this_cpu_ops.

V2->V3:
        - Do not use CONFIG_DEBUG_VM to enable cmpxchg diagnostics. Use
          custom SLUB define instead.
        - Add patch to move min_partial into a different cacheline.
        - Rediff


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [cpuops cmpxchg double V3 1/5] slub: min_partial needs to be in first cacheline
  2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
@ 2011-02-25 17:38 ` Christoph Lameter
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 2/5] slub: Get rid of slab_free_hook_irq() Christoph Lameter
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

[-- Attachment #1: slub_min_partial_first_cacheline --]
[-- Type: text/plain, Size: 1103 bytes --]

It is used in unfreeze_slab() which is a performance critical
function.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2011-01-28 11:57:29.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2011-01-28 11:57:52.000000000 -0600
@@ -70,6 +70,7 @@ struct kmem_cache {
 	struct kmem_cache_cpu __percpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
+	unsigned long min_partial;
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
 	int offset;		/* Free pointer offset. */
@@ -83,7 +84,6 @@ struct kmem_cache {
 	void (*ctor)(void *);
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
-	unsigned long min_partial;
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SYSFS


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [cpuops cmpxchg double V3 2/5] slub: Get rid of slab_free_hook_irq()
  2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 1/5] slub: min_partial needs to be in first cacheline Christoph Lameter
@ 2011-02-25 17:38 ` Christoph Lameter
  2011-02-25 18:23   ` Mathieu Desnoyers
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double Christoph Lameter
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

[-- Attachment #1: slub_remove_irq_freehook --]
[-- Type: text/plain, Size: 2269 bytes --]

The following patch will make the fastpaths lockless and will no longer
require interrupts to be disabled. Calling the free hook with irq disabled
will no longer be possible.

Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
interrupts if the features are selected that require callbacks with
interrupts off and reenable after calls have been made.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-01-10 12:06:58.000000000 -0600
+++ linux-2.6/mm/slub.c	2011-01-10 12:07:11.000000000 -0600
@@ -807,14 +807,24 @@ static inline void slab_post_alloc_hook(
 static inline void slab_free_hook(struct kmem_cache *s, void *x)
 {
 	kmemleak_free_recursive(x, s->flags);
-}
 
-static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
-{
-	kmemcheck_slab_free(s, object, s->objsize);
-	debug_check_no_locks_freed(object, s->objsize);
-	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, s->objsize);
+	/*
+	 * Trouble is that we may no longer disable interupts in the fast path
+	 * So in order to make the debug calls that expect irqs to be
+	 * disabled we need to disable interrupts temporarily.
+	 */
+#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
+	{
+		unsigned long flags;
+
+		local_irq_save(flags);
+		kmemcheck_slab_free(s, x, s->objsize);
+		debug_check_no_locks_freed(x, s->objsize);
+		if (!(s->flags & SLAB_DEBUG_OBJECTS))
+			debug_check_no_obj_freed(x, s->objsize);
+		local_irq_restore(flags);
+	}
+#endif
 }
 
 /*
@@ -1101,9 +1111,6 @@ static inline void slab_post_alloc_hook(
 
 static inline void slab_free_hook(struct kmem_cache *s, void *x) {}
 
-static inline void slab_free_hook_irq(struct kmem_cache *s,
-		void *object) {}
-
 #endif /* CONFIG_SLUB_DEBUG */
 
 /*
@@ -1909,8 +1916,6 @@ static __always_inline void slab_free(st
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
 
-	slab_free_hook_irq(s, x);
-
 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 1/5] slub: min_partial needs to be in first cacheline Christoph Lameter
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 2/5] slub: Get rid of slab_free_hook_irq() Christoph Lameter
@ 2011-02-25 17:38 ` Christoph Lameter
  2011-02-25 18:25   ` Mathieu Desnoyers
                     ` (2 more replies)
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub Christoph Lameter
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

[-- Attachment #1: cpuops_double_generic --]
[-- Type: text/plain, Size: 6844 bytes --]

Introduce this_cpu_cmpxchg_double. this_cpu_cmpxchg_double() allows the
comparision between two consecutive words and replaces them if there is
a match.

	bool this_cpu_cmpxchg_double(pcp1, pcp2,
		old_word1, old_word2, new_word1, new_word2)

this_cpu_cmpxchg_double does not return the old value (difficult since
there are two words) but a boolean indicating if the operation was
successful.

The first percpu variable must be double word aligned!

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/percpu.h |  130 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 130 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2011-01-10 10:22:35.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2011-01-10 10:26:43.000000000 -0600
@@ -255,6 +255,29 @@ extern void __bad_size_call_parameter(vo
 	pscr2_ret__;							\
 })
 
+/*
+ * Special handling for cmpxchg_double. cmpxchg_double is passed two
+ * percpu variables. The first has to be aligned to a double word
+ * boundary and the second has to follow directly thereafter.
+ */
+#define __pcpu_double_call_return_int(stem, pcp1, pcp2, ...)		\
+({									\
+	int ret__;							\
+	__verify_pcpu_ptr(&pcp1);					\
+	VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));		\
+	VM_BUG_ON((unsigned long)(&pcp2) != (unsigned long)(&pcp1) + sizeof(pcp1));\
+	VM_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\
+	switch(sizeof(pcp1)) {						\
+	case 1: ret__ = stem##1(pcp1, pcp2, __VA_ARGS__);break;		\
+	case 2: ret__ = stem##2(pcp1, pcp2, __VA_ARGS__);break;		\
+	case 4: ret__ = stem##4(pcp1, pcp2, __VA_ARGS__);break;		\
+	case 8: ret__ = stem##8(pcp1, pcp2, __VA_ARGS__);break;		\
+	default:							\
+		__bad_size_call_parameter();break;			\
+	}								\
+	ret__;								\
+})
+
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -318,6 +341,80 @@ do {									\
 # define this_cpu_read(pcp)	__pcpu_size_call_return(this_cpu_read_, (pcp))
 #endif
 
+/*
+ * cmpxchg_double replaces two adjacent scalars at once. The first two
+ * parameters are per cpu variables which have to be of the same size.
+ * A truth value is returned to indicate success or
+ * failure (since a double register result is difficult to handle).
+ * There is very limited hardware support for these operations. So only certain
+ * sizes may work.
+ */
+#define __this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+({									\
+	int __ret = 0;							\
+	if (__this_cpu_read(pcp1) == (oval1) &&				\
+			 __this_cpu_read(pcp2)  == (oval2)) {		\
+		__this_cpu_write(pcp1, (nval1));			\
+		__this_cpu_write(pcp2, (nval2));			\
+		__ret = 1;						\
+	}								\
+	(__ret);							\
+})
+
+#ifndef __this_cpu_cmpxchg_double
+# ifndef __this_cpu_cmpxchg_double_1
+#  define __this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_2
+#  define __this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_4
+#  define __this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_8
+#  define __this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# define __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__pcpu_double_call_return_int(__this_cpu_cmpxchg_double_, (pcp1), (pcp2)	\
+					 oval1, oval2, nval1, nval2)
+#endif
+
+#define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+({									\
+	int ret__;							\
+	preempt_disable();						\
+	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
+			oval1, oval2, nval1, nval2);			\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_cmpxchg_double
+# ifndef this_cpu_cmpxchg_double_1
+#  define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_2
+#  define this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_4
+#  define this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_8
+#  define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__pcpu_double_call_return_int(this_cpu_cmpxchg_double_, (pcp1), (pcp2),	\
+		oval1, oval2, nval1, nval2)
+#endif
+
 #define _this_cpu_generic_to_op(pcp, val, op)				\
 do {									\
 	preempt_disable();						\
@@ -823,4 +920,37 @@ do {									\
 	__pcpu_size_call_return2(irqsafe_cpu_cmpxchg_, (pcp), oval, nval)
 #endif
 
+#define irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+({									\
+	int ret__;							\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
+	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
+			oval1, oval2, nval1, nval2);			\
+	local_irq_restore(flags);					\
+	ret__;								\
+})
+
+#ifndef irqsafe_cpu_cmpxchg_double
+# ifndef irqsafe_cpu_cmpxchg_double_1
+#  define irqsafe_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_double_2
+#  define irqsafe_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_double_4
+#  define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_double_8
+#  define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2),	\
+		oval1, oval2, nval1, nval2)
+#endif
+
 #endif /* __LINUX_PERCPU_H */


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub
  2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
                   ` (2 preceding siblings ...)
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double Christoph Lameter
@ 2011-02-25 17:38 ` Christoph Lameter
  2011-02-25 18:21   ` Mathieu Desnoyers
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 5/5] x86: this_cpu_cmpxchg_double() support Christoph Lameter
  2011-02-28 10:36 ` [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Tejun Heo
  5 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

[-- Attachment #1: cpuops_double_slub_fastpath --]
[-- Type: text/plain, Size: 12792 bytes --]

Use the this_cpu_cmpxchg_double functionality to implement a lockless
allocation algorithm on arches that support fast this_cpu_ops.

Each of the per cpu pointers is paired with a transaction id that ensures
that updates of the per cpu information can only occur in sequence on
a certain cpu.

A transaction id is a "long" integer that is comprised of an event number
and the cpu number. The event number is incremented for every change to the
per cpu state. This means that the cmpxchg instruction can verify for an
update that nothing interfered and that we are updating the percpu structure
for the processor where we picked up the information and that we are also
currently on that processor when we update the information.

This results in a significant decrease of the overhead in the fastpaths. It
also makes it easy to adopt the fast path for realtime kernels since this
is lockless and does not require the use of the current per cpu area
over the critical section. It is only important that the per cpu area is
current at the beginning of the critical section and at the end.

So there is no need even to disable preemption.

Test results show that the fastpath cycle count is reduced by up to ~ 40%
(alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
adds a few cycles.

Sadly this does nothing for the slowpath which is where the main issues with
performance in slub are but the best case performance rises significantly.
(For that see the more complex slub patches that require cmpxchg_double)

Kmalloc: alloc/free test

Before:

10000 times kmalloc(8)/kfree -> 134 cycles
10000 times kmalloc(16)/kfree -> 152 cycles
10000 times kmalloc(32)/kfree -> 144 cycles
10000 times kmalloc(64)/kfree -> 142 cycles
10000 times kmalloc(128)/kfree -> 142 cycles
10000 times kmalloc(256)/kfree -> 132 cycles
10000 times kmalloc(512)/kfree -> 132 cycles
10000 times kmalloc(1024)/kfree -> 135 cycles
10000 times kmalloc(2048)/kfree -> 135 cycles
10000 times kmalloc(4096)/kfree -> 135 cycles
10000 times kmalloc(8192)/kfree -> 144 cycles
10000 times kmalloc(16384)/kfree -> 754 cycles

After:

10000 times kmalloc(8)/kfree -> 78 cycles
10000 times kmalloc(16)/kfree -> 78 cycles
10000 times kmalloc(32)/kfree -> 82 cycles
10000 times kmalloc(64)/kfree -> 88 cycles
10000 times kmalloc(128)/kfree -> 79 cycles
10000 times kmalloc(256)/kfree -> 79 cycles
10000 times kmalloc(512)/kfree -> 85 cycles
10000 times kmalloc(1024)/kfree -> 82 cycles
10000 times kmalloc(2048)/kfree -> 82 cycles
10000 times kmalloc(4096)/kfree -> 85 cycles
10000 times kmalloc(8192)/kfree -> 82 cycles
10000 times kmalloc(16384)/kfree -> 706 cycles


Kmalloc: Repeatedly allocate then free test

Before:

10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles

After:

10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    5 -
 mm/slub.c                |  205 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 207 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2011-02-25 10:45:49.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2011-02-25 10:46:19.000000000 -0600
@@ -35,7 +35,10 @@ enum stat_item {
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to first free per cpu object */
+	void **freelist;	/* Pointer to next available object */
+#ifdef CONFIG_CMPXCHG_LOCAL
+	unsigned long tid;	/* Globally unique transaction id */
+#endif
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-02-25 10:46:00.000000000 -0600
+++ linux-2.6/mm/slub.c	2011-02-25 10:46:57.000000000 -0600
@@ -1494,6 +1494,77 @@ static void unfreeze_slab(struct kmem_ca
 	}
 }
 
+#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef CONFIG_PREEMPT
+/*
+ * Calculate the next globally unique transaction for disambiguiation
+ * during cmpxchg. The transactions start with the cpu number and are then
+ * incremented by CONFIG_NR_CPUS.
+ */
+#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
+#else
+/*
+ * No preemption supported therefore also no need to check for
+ * different cpus.
+ */
+#define TID_STEP 1
+#endif
+
+static inline unsigned long next_tid(unsigned long tid)
+{
+	return tid + TID_STEP;
+}
+
+static inline unsigned int tid_to_cpu(unsigned long tid)
+{
+	return tid % TID_STEP;
+}
+
+static inline unsigned long tid_to_event(unsigned long tid)
+{
+	return tid / TID_STEP;
+}
+
+static inline unsigned int init_tid(int cpu)
+{
+	return cpu;
+}
+
+static inline void note_cmpxchg_failure(const char *n,
+		const struct kmem_cache *s, unsigned long tid)
+{
+#ifdef SLUB_DEBUG_CMPXCHG
+	unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
+
+	printk(KERN_INFO "%s %s: cmpxchg redo ", n, s->name);
+
+#ifdef CONFIG_PREEMPT
+	if (tid_to_cpu(tid) != tid_to_cpu(actual_tid))
+		printk("due to cpu change %d -> %d\n",
+			tid_to_cpu(tid), tid_to_cpu(actual_tid));
+	else
+#endif
+	if (tid_to_event(tid) != tid_to_event(actual_tid))
+		printk("due to cpu running other code. Event %ld->%ld\n",
+			tid_to_event(tid), tid_to_event(actual_tid));
+	else
+		printk("for unknown reason: actual=%lx was=%lx target=%lx\n",
+			actual_tid, tid, next_tid(tid));
+#endif
+}
+
+#endif
+
+void init_kmem_cache_cpus(struct kmem_cache *s)
+{
+#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
+#endif
+
+}
 /*
  * Remove the cpu slab
  */
@@ -1525,6 +1596,9 @@ static void deactivate_slab(struct kmem_
 		page->inuse--;
 	}
 	c->page = NULL;
+#ifdef CONFIG_CMPXCHG_LOCAL
+	c->tid = next_tid(c->tid);
+#endif
 	unfreeze_slab(s, page, tail);
 }
 
@@ -1659,6 +1733,19 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+#ifdef CONFIG_CMPXCHG_LOCAL
+	unsigned long flags;
+
+	local_irq_save(flags);
+#ifdef CONFIG_PREEMPT
+	/*
+	 * We may have been preempted and rescheduled on a different
+	 * cpu before disabling interrupts. Need to reload cpu area
+	 * pointer.
+	 */
+	c = this_cpu_ptr(s->cpu_slab);
+#endif
+#endif
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1685,6 +1772,10 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
+#ifdef CONFIG_CMPXCHG_LOCAL
+	c->tid = next_tid(c->tid);
+	local_irq_restore(flags);
+#endif
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
@@ -1746,23 +1837,76 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
+#ifdef CONFIG_CMPXCHG_LOCAL
+	unsigned long tid;
+#else
 	unsigned long flags;
+#endif
 
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
+#ifndef CONFIG_CMPXCHG_LOCAL
 	local_irq_save(flags);
+#else
+redo:
+#endif
+
+	/*
+	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
+	 * enabled. We may switch back and forth between cpus while
+	 * reading from one cpu area. That does not matter as long
+	 * as we end up on the original cpu again when doing the cmpxchg.
+	 */
 	c = __this_cpu_ptr(s->cpu_slab);
+
+#ifdef CONFIG_CMPXCHG_LOCAL
+	/*
+	 * The transaction ids are globally unique per cpu and per operation on
+	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
+	 * occurs on the right processor and that there was no operation on the
+	 * linked list in between.
+	 */
+	tid = c->tid;
+	barrier();
+#endif
+
 	object = c->freelist;
 	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
+#ifdef CONFIG_CMPXCHG_LOCAL
+		/*
+		 * The cmpxchg will only match if there was no additonal
+		 * operation and if we are on the right processor.
+		 *
+		 * The cmpxchg does the following atomically (without lock semantics!)
+		 * 1. Relocate first pointer to the current per cpu area.
+		 * 2. Verify that tid and freelist have not been changed
+		 * 3. If they were not changed replace tid and freelist
+		 *
+		 * Since this is without lock semantics the protection is only against
+		 * code executing on this cpu *not* from access by other cpus.
+		 */
+		if (unlikely(!this_cpu_cmpxchg_double(
+				s->cpu_slab->freelist, s->cpu_slab->tid,
+				object, tid,
+				get_freepointer(s, object), next_tid(tid)))) {
+
+			note_cmpxchg_failure("slab_alloc", s, tid);
+			goto redo;
+		}
+#else
 		c->freelist = get_freepointer(s, object);
+#endif
 		stat(s, ALLOC_FASTPATH);
 	}
+
+#ifndef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
+#endif
 
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
 		memset(object, 0, s->objsize);
@@ -1840,9 +1984,13 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
+#ifdef CONFIG_CMPXCHG_LOCAL
+	unsigned long flags;
 
-	stat(s, FREE_SLOWPATH);
+	local_irq_save(flags);
+#endif
 	slab_lock(page);
+	stat(s, FREE_SLOWPATH);
 
 	if (kmem_cache_debug(s))
 		goto debug;
@@ -1872,6 +2020,9 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
+#ifdef CONFIG_CMPXCHG_LOCAL
+	local_irq_restore(flags);
+#endif
 	return;
 
 slab_empty:
@@ -1883,6 +2034,9 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
+#ifdef CONFIG_CMPXCHG_LOCAL
+	local_irq_restore(flags);
+#endif
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -1909,21 +2063,54 @@ static __always_inline void slab_free(st
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
+#ifdef CONFIG_CMPXCHG_LOCAL
+	unsigned long tid;
+#else
 	unsigned long flags;
+#endif
 
 	slab_free_hook(s, x);
 
+#ifndef CONFIG_CMPXCHG_LOCAL
 	local_irq_save(flags);
+#endif
+
+redo:
+	/*
+	 * Determine the currently cpus per cpu slab.
+	 * The cpu may change afterward. However that does not matter since
+	 * data is retrieved via this pointer. If we are on the same cpu
+	 * during the cmpxchg then the free will succedd.
+	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
+#ifdef CONFIG_CMPXCHG_LOCAL
+	tid = c->tid;
+	barrier();
+#endif
+
 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);
+
+#ifdef CONFIG_CMPXCHG_LOCAL
+		if (unlikely(!this_cpu_cmpxchg_double(
+				s->cpu_slab->freelist, s->cpu_slab->tid,
+				c->freelist, tid,
+				object, next_tid(tid)))) {
+
+			note_cmpxchg_failure("slab_free", s, tid);
+			goto redo;
+		}
+#else
 		c->freelist = object;
+#endif
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
 
+#ifndef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
+#endif
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -2115,9 +2302,23 @@ static inline int alloc_kmem_cache_cpus(
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
 
+#ifdef CONFIG_CMPXCHG_LOCAL
+	/*
+	 * Must align to double word boundary for the double cmpxchg instructions
+	 * to work.
+	 */
+	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu), 2 * sizeof(void *));
+#else
+	/* Regular alignment is sufficient */
 	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
+#endif
+
+	if (!s->cpu_slab)
+		return 0;
 
-	return s->cpu_slab != NULL;
+	init_kmem_cache_cpus(s);
+
+	return 1;
 }
 
 static struct kmem_cache *kmem_cache_node;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [cpuops cmpxchg double V3 5/5] x86: this_cpu_cmpxchg_double() support
  2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
                   ` (3 preceding siblings ...)
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub Christoph Lameter
@ 2011-02-25 17:38 ` Christoph Lameter
  2011-02-28 10:23   ` [PATCH] percpu, x86: Add arch-specific " Tejun Heo
  2011-02-28 10:36 ` [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Tejun Heo
  5 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

[-- Attachment #1: cpuops_double_x86 --]
[-- Type: text/plain, Size: 5262 bytes --]

Support this_cpu_cmpxchg_double using the cmpxchg16b and cmpxchg8b instructions.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/include/asm/percpu.h |   48 ++++++++++++++++++++++++++++++++
 arch/x86/lib/Makefile         |    1 
 arch/x86/lib/cmpxchg16b_emu.S |   62 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2011-02-25 10:45:46.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2011-02-25 10:46:25.000000000 -0600
@@ -451,6 +451,26 @@ do {									\
 #define irqsafe_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 #endif /* !CONFIG_M386 */
 
+#ifdef CONFIG_X86_CMPXCHG64
+#define percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)			\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy = n2;					\
+	asm volatile("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t"		\
+		    : "=a"(__ret), "=m" (pcp1), "=d"(__dummy)		\
+		    :  "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2));	\
+	__ret;								\
+})
+
+#define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#endif /* CONFIG_X86_CMPXCHG64 */
+
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -480,6 +500,34 @@ do {									\
 #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define irqsafe_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
 #define irqsafe_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
+
+/*
+ * Pretty complex macro to generate cmpxchg16 instruction. The instruction
+ * is not supported on early AMD64 processors so we must be able to emulate
+ * it in software. The address used in the cmpxchg16 instruction must be
+ * aligned to a 16 byte boundary.
+ */
+#define percpu_cmpxchg16b(pcp1, o1, o2, n1, n2)				\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy;						\
+	alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4,	\
+			"cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t",	\
+			X86_FEATURE_CX16,				\
+		    	ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)),	\
+		        "S" (&pcp1), "b"(__n1), "c"(__n2),		\
+			 "a"(__o1), "d"(__o2));				\
+	__ret;								\
+})
+
+#define __this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b(pcp1, o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg16b(pcp1, o1, o2, n1, n2)
+#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg16b(pcp1, o1, o2, n1, n2)
+
 #endif
 
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
Index: linux-2.6/arch/x86/lib/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/lib/Makefile	2011-02-22 16:13:42.000000000 -0600
+++ linux-2.6/arch/x86/lib/Makefile	2011-02-25 10:46:25.000000000 -0600
@@ -42,4 +42,5 @@ else
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o rwlock_64.o copy_user_nocache_64.o
 	lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem_64.o
+	lib-y += cmpxchg16b_emu.o
 endif
Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S	2011-02-25 10:46:25.000000000 -0600
@@ -0,0 +1,62 @@
+/*
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; version 2
+ *	of the License.
+ *
+ */
+
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/frame.h>
+#include <asm/dwarf2.h>
+
+
+.text
+
+/*
+ * Inputs:
+ * %rsi : memory location to compare
+ * %rax : low 64 bits of old value
+ * %rdx : high 64 bits of old value
+ * %rbx : low 64 bits of new value
+ * %rcx : high 64 bits of new value
+ * %al  : Operation successful
+ */
+ENTRY(this_cpu_cmpxchg16b_emu)
+CFI_STARTPROC
+
+#
+# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in
+# al not via the ZF. Caller will access al to get result.
+#
+# Note that this is only useful for a cpuops operation. Meaning that we
+# do *not* have a fully atomic operation but just an operation that is
+# *atomic* on a single cpu (as provided by the this_cpu_xx class of macros)
+#
+this_cpu_cmpxchg16b_emu:
+	pushf
+	cli
+
+	cmpq  %gs:(%rsi), %rax
+	jne not_same
+	cmpq %gs:8(%rsi), %rdx
+	jne not_same
+
+	movq %rbx,  %gs:(%rsi)
+	movq %rcx, %gs:8(%rsi)
+
+	popf
+	mov $1, %al
+	ret
+
+ not_same:
+	popf
+	xor  %al,%al
+	ret
+
+CFI_ENDPROC
+
+ENDPROC(this_cpu_cmpxchg16b_emu)
+
+


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub Christoph Lameter
@ 2011-02-25 18:21   ` Mathieu Desnoyers
  2011-02-25 20:46     ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2011-02-25 18:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin

* Christoph Lameter (cl@linux.com) wrote:
> Use the this_cpu_cmpxchg_double functionality to implement a lockless
> allocation algorithm on arches that support fast this_cpu_ops.
> 
> Each of the per cpu pointers is paired with a transaction id that ensures
> that updates of the per cpu information can only occur in sequence on
> a certain cpu.
> 
> A transaction id is a "long" integer that is comprised of an event number
> and the cpu number. The event number is incremented for every change to the
> per cpu state. This means that the cmpxchg instruction can verify for an
> update that nothing interfered and that we are updating the percpu structure
> for the processor where we picked up the information and that we are also
> currently on that processor when we update the information.
> 
> This results in a significant decrease of the overhead in the fastpaths. It
> also makes it easy to adopt the fast path for realtime kernels since this
> is lockless and does not require the use of the current per cpu area
> over the critical section. It is only important that the per cpu area is
> current at the beginning of the critical section and at the end.
> 
> So there is no need even to disable preemption.
> 
> Test results show that the fastpath cycle count is reduced by up to ~ 40%
> (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
> adds a few cycles.
> 
> Sadly this does nothing for the slowpath which is where the main issues with
> performance in slub are but the best case performance rises significantly.
> (For that see the more complex slub patches that require cmpxchg_double)
> 
> Kmalloc: alloc/free test
> 
> Before:
> 
> 10000 times kmalloc(8)/kfree -> 134 cycles
> 10000 times kmalloc(16)/kfree -> 152 cycles
> 10000 times kmalloc(32)/kfree -> 144 cycles
> 10000 times kmalloc(64)/kfree -> 142 cycles
> 10000 times kmalloc(128)/kfree -> 142 cycles
> 10000 times kmalloc(256)/kfree -> 132 cycles
> 10000 times kmalloc(512)/kfree -> 132 cycles
> 10000 times kmalloc(1024)/kfree -> 135 cycles
> 10000 times kmalloc(2048)/kfree -> 135 cycles
> 10000 times kmalloc(4096)/kfree -> 135 cycles
> 10000 times kmalloc(8192)/kfree -> 144 cycles
> 10000 times kmalloc(16384)/kfree -> 754 cycles
> 
> After:
> 
> 10000 times kmalloc(8)/kfree -> 78 cycles
> 10000 times kmalloc(16)/kfree -> 78 cycles
> 10000 times kmalloc(32)/kfree -> 82 cycles
> 10000 times kmalloc(64)/kfree -> 88 cycles
> 10000 times kmalloc(128)/kfree -> 79 cycles
> 10000 times kmalloc(256)/kfree -> 79 cycles
> 10000 times kmalloc(512)/kfree -> 85 cycles
> 10000 times kmalloc(1024)/kfree -> 82 cycles
> 10000 times kmalloc(2048)/kfree -> 82 cycles
> 10000 times kmalloc(4096)/kfree -> 85 cycles
> 10000 times kmalloc(8192)/kfree -> 82 cycles
> 10000 times kmalloc(16384)/kfree -> 706 cycles
> 
> 
> Kmalloc: Repeatedly allocate then free test
> 
> Before:
> 
> 10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
> 10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
> 10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
> 10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
> 10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
> 10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
> 10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
> 10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
> 10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
> 10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
> 10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
> 10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
> 
> After:
> 
> 10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
> 10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
> 10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
> 10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
> 10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
> 10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
> 10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
> 10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
> 10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
> 10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
> 10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
> 10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  include/linux/slub_def.h |    5 -
>  mm/slub.c                |  205 ++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 207 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2011-02-25 10:45:49.000000000 -0600
> +++ linux-2.6/include/linux/slub_def.h	2011-02-25 10:46:19.000000000 -0600
> @@ -35,7 +35,10 @@ enum stat_item {
>  	NR_SLUB_STAT_ITEMS };
>  
>  struct kmem_cache_cpu {
> -	void **freelist;	/* Pointer to first free per cpu object */
> +	void **freelist;	/* Pointer to next available object */
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	unsigned long tid;	/* Globally unique transaction id */
> +#endif

There seem to be no strong guarantee that freelist is double-word aligned here.
How about:

struct kmem_cache_cpu {
#ifdef CONFIG_CMPCHG_LOCAL
        struct {
                void **ptr;
                unsigned long tid;
        } __attribute__((aligned(2 * sizeof(long))) freelist;
#else
        struct {
                void **ptr;
        } freelist;
#endif
        ...

Or if you really don't want to change all the code that touches freelist, we
could maybe go for:

struct kmem_cache_cpu {
#ifdef CONFIG_CMPCHG_LOCAL
        void ** __attribute__((aligned(2 * sizeof(long))) freelist;
        unsigned long tid;
#else
        void **freelist;
#endif
        ...

(code above untested)

Thoughts ?

Mathieu

>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
>  #ifdef CONFIG_SLUB_STATS
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2011-02-25 10:46:00.000000000 -0600
> +++ linux-2.6/mm/slub.c	2011-02-25 10:46:57.000000000 -0600
> @@ -1494,6 +1494,77 @@ static void unfreeze_slab(struct kmem_ca
>  	}
>  }
>  
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef CONFIG_PREEMPT
> +/*
> + * Calculate the next globally unique transaction for disambiguiation
> + * during cmpxchg. The transactions start with the cpu number and are then
> + * incremented by CONFIG_NR_CPUS.
> + */
> +#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
> +#else
> +/*
> + * No preemption supported therefore also no need to check for
> + * different cpus.
> + */
> +#define TID_STEP 1
> +#endif
> +
> +static inline unsigned long next_tid(unsigned long tid)
> +{
> +	return tid + TID_STEP;
> +}
> +
> +static inline unsigned int tid_to_cpu(unsigned long tid)
> +{
> +	return tid % TID_STEP;
> +}
> +
> +static inline unsigned long tid_to_event(unsigned long tid)
> +{
> +	return tid / TID_STEP;
> +}
> +
> +static inline unsigned int init_tid(int cpu)
> +{
> +	return cpu;
> +}
> +
> +static inline void note_cmpxchg_failure(const char *n,
> +		const struct kmem_cache *s, unsigned long tid)
> +{
> +#ifdef SLUB_DEBUG_CMPXCHG
> +	unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
> +
> +	printk(KERN_INFO "%s %s: cmpxchg redo ", n, s->name);
> +
> +#ifdef CONFIG_PREEMPT
> +	if (tid_to_cpu(tid) != tid_to_cpu(actual_tid))
> +		printk("due to cpu change %d -> %d\n",
> +			tid_to_cpu(tid), tid_to_cpu(actual_tid));
> +	else
> +#endif
> +	if (tid_to_event(tid) != tid_to_event(actual_tid))
> +		printk("due to cpu running other code. Event %ld->%ld\n",
> +			tid_to_event(tid), tid_to_event(actual_tid));
> +	else
> +		printk("for unknown reason: actual=%lx was=%lx target=%lx\n",
> +			actual_tid, tid, next_tid(tid));
> +#endif
> +}
> +
> +#endif
> +
> +void init_kmem_cache_cpus(struct kmem_cache *s)
> +{
> +#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
> +#endif
> +
> +}
>  /*
>   * Remove the cpu slab
>   */
> @@ -1525,6 +1596,9 @@ static void deactivate_slab(struct kmem_
>  		page->inuse--;
>  	}
>  	c->page = NULL;
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	c->tid = next_tid(c->tid);
> +#endif
>  	unfreeze_slab(s, page, tail);
>  }
>  
> @@ -1659,6 +1733,19 @@ static void *__slab_alloc(struct kmem_ca
>  {
>  	void **object;
>  	struct page *new;
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +#ifdef CONFIG_PREEMPT
> +	/*
> +	 * We may have been preempted and rescheduled on a different
> +	 * cpu before disabling interrupts. Need to reload cpu area
> +	 * pointer.
> +	 */
> +	c = this_cpu_ptr(s->cpu_slab);
> +#endif
> +#endif
>  
>  	/* We handle __GFP_ZERO in the caller */
>  	gfpflags &= ~__GFP_ZERO;
> @@ -1685,6 +1772,10 @@ load_freelist:
>  	c->node = page_to_nid(c->page);
>  unlock_out:
>  	slab_unlock(c->page);
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	c->tid = next_tid(c->tid);
> +	local_irq_restore(flags);
> +#endif
>  	stat(s, ALLOC_SLOWPATH);
>  	return object;
>  
> @@ -1746,23 +1837,76 @@ static __always_inline void *slab_alloc(
>  {
>  	void **object;
>  	struct kmem_cache_cpu *c;
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	unsigned long tid;
> +#else
>  	unsigned long flags;
> +#endif
>  
>  	if (slab_pre_alloc_hook(s, gfpflags))
>  		return NULL;
>  
> +#ifndef CONFIG_CMPXCHG_LOCAL
>  	local_irq_save(flags);
> +#else
> +redo:
> +#endif
> +
> +	/*
> +	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
> +	 * enabled. We may switch back and forth between cpus while
> +	 * reading from one cpu area. That does not matter as long
> +	 * as we end up on the original cpu again when doing the cmpxchg.
> +	 */
>  	c = __this_cpu_ptr(s->cpu_slab);
> +
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	/*
> +	 * The transaction ids are globally unique per cpu and per operation on
> +	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
> +	 * occurs on the right processor and that there was no operation on the
> +	 * linked list in between.
> +	 */
> +	tid = c->tid;
> +	barrier();
> +#endif
> +
>  	object = c->freelist;
>  	if (unlikely(!object || !node_match(c, node)))
>  
>  		object = __slab_alloc(s, gfpflags, node, addr, c);
>  
>  	else {
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +		/*
> +		 * The cmpxchg will only match if there was no additonal
> +		 * operation and if we are on the right processor.
> +		 *
> +		 * The cmpxchg does the following atomically (without lock semantics!)
> +		 * 1. Relocate first pointer to the current per cpu area.
> +		 * 2. Verify that tid and freelist have not been changed
> +		 * 3. If they were not changed replace tid and freelist
> +		 *
> +		 * Since this is without lock semantics the protection is only against
> +		 * code executing on this cpu *not* from access by other cpus.
> +		 */
> +		if (unlikely(!this_cpu_cmpxchg_double(
> +				s->cpu_slab->freelist, s->cpu_slab->tid,
> +				object, tid,
> +				get_freepointer(s, object), next_tid(tid)))) {
> +
> +			note_cmpxchg_failure("slab_alloc", s, tid);
> +			goto redo;
> +		}
> +#else
>  		c->freelist = get_freepointer(s, object);
> +#endif
>  		stat(s, ALLOC_FASTPATH);
>  	}
> +
> +#ifndef CONFIG_CMPXCHG_LOCAL
>  	local_irq_restore(flags);
> +#endif
>  
>  	if (unlikely(gfpflags & __GFP_ZERO) && object)
>  		memset(object, 0, s->objsize);
> @@ -1840,9 +1984,13 @@ static void __slab_free(struct kmem_cach
>  {
>  	void *prior;
>  	void **object = (void *)x;
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	unsigned long flags;
>  
> -	stat(s, FREE_SLOWPATH);
> +	local_irq_save(flags);
> +#endif
>  	slab_lock(page);
> +	stat(s, FREE_SLOWPATH);
>  
>  	if (kmem_cache_debug(s))
>  		goto debug;
> @@ -1872,6 +2020,9 @@ checks_ok:
>  
>  out_unlock:
>  	slab_unlock(page);
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	local_irq_restore(flags);
> +#endif
>  	return;
>  
>  slab_empty:
> @@ -1883,6 +2034,9 @@ slab_empty:
>  		stat(s, FREE_REMOVE_PARTIAL);
>  	}
>  	slab_unlock(page);
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	local_irq_restore(flags);
> +#endif
>  	stat(s, FREE_SLAB);
>  	discard_slab(s, page);
>  	return;
> @@ -1909,21 +2063,54 @@ static __always_inline void slab_free(st
>  {
>  	void **object = (void *)x;
>  	struct kmem_cache_cpu *c;
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	unsigned long tid;
> +#else
>  	unsigned long flags;
> +#endif
>  
>  	slab_free_hook(s, x);
>  
> +#ifndef CONFIG_CMPXCHG_LOCAL
>  	local_irq_save(flags);
> +#endif
> +
> +redo:
> +	/*
> +	 * Determine the currently cpus per cpu slab.
> +	 * The cpu may change afterward. However that does not matter since
> +	 * data is retrieved via this pointer. If we are on the same cpu
> +	 * during the cmpxchg then the free will succedd.
> +	 */
>  	c = __this_cpu_ptr(s->cpu_slab);
>  
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	tid = c->tid;
> +	barrier();
> +#endif
> +
>  	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
>  		set_freepointer(s, object, c->freelist);
> +
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +		if (unlikely(!this_cpu_cmpxchg_double(
> +				s->cpu_slab->freelist, s->cpu_slab->tid,
> +				c->freelist, tid,
> +				object, next_tid(tid)))) {
> +
> +			note_cmpxchg_failure("slab_free", s, tid);
> +			goto redo;
> +		}
> +#else
>  		c->freelist = object;
> +#endif
>  		stat(s, FREE_FASTPATH);
>  	} else
>  		__slab_free(s, page, x, addr);
>  
> +#ifndef CONFIG_CMPXCHG_LOCAL
>  	local_irq_restore(flags);
> +#endif
>  }
>  
>  void kmem_cache_free(struct kmem_cache *s, void *x)
> @@ -2115,9 +2302,23 @@ static inline int alloc_kmem_cache_cpus(
>  	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
>  			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
>  
> +#ifdef CONFIG_CMPXCHG_LOCAL
> +	/*
> +	 * Must align to double word boundary for the double cmpxchg instructions
> +	 * to work.
> +	 */
> +	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu), 2 * sizeof(void *));
> +#else
> +	/* Regular alignment is sufficient */
>  	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
> +#endif
> +
> +	if (!s->cpu_slab)
> +		return 0;
>  
> -	return s->cpu_slab != NULL;
> +	init_kmem_cache_cpus(s);
> +
> +	return 1;
>  }
>  
>  static struct kmem_cache *kmem_cache_node;
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 2/5] slub: Get rid of slab_free_hook_irq()
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 2/5] slub: Get rid of slab_free_hook_irq() Christoph Lameter
@ 2011-02-25 18:23   ` Mathieu Desnoyers
  0 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2011-02-25 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin

* Christoph Lameter (cl@linux.com) wrote:
> The following patch will make the fastpaths lockless and will no longer
> require interrupts to be disabled. Calling the free hook with irq disabled
> will no longer be possible.
> 
> Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
> interrupts if the features are selected that require callbacks with
> interrupts off and reenable after calls have been made.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  mm/slub.c |   29 +++++++++++++++++------------
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2011-01-10 12:06:58.000000000 -0600
> +++ linux-2.6/mm/slub.c	2011-01-10 12:07:11.000000000 -0600
> @@ -807,14 +807,24 @@ static inline void slab_post_alloc_hook(
>  static inline void slab_free_hook(struct kmem_cache *s, void *x)
>  {
>  	kmemleak_free_recursive(x, s->flags);
> -}
>  
> -static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
> -{
> -	kmemcheck_slab_free(s, object, s->objsize);
> -	debug_check_no_locks_freed(object, s->objsize);
> -	if (!(s->flags & SLAB_DEBUG_OBJECTS))
> -		debug_check_no_obj_freed(object, s->objsize);
> +	/*
> +	 * Trouble is that we may no longer disable interupts in the fast path

interrupts

/nitpick ;)

Mathieu

> +	 * So in order to make the debug calls that expect irqs to be
> +	 * disabled we need to disable interrupts temporarily.
> +	 */
> +#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
> +	{
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		kmemcheck_slab_free(s, x, s->objsize);
> +		debug_check_no_locks_freed(x, s->objsize);
> +		if (!(s->flags & SLAB_DEBUG_OBJECTS))
> +			debug_check_no_obj_freed(x, s->objsize);
> +		local_irq_restore(flags);
> +	}
> +#endif
>  }
>  
>  /*
> @@ -1101,9 +1111,6 @@ static inline void slab_post_alloc_hook(
>  
>  static inline void slab_free_hook(struct kmem_cache *s, void *x) {}
>  
> -static inline void slab_free_hook_irq(struct kmem_cache *s,
> -		void *object) {}
> -
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  /*
> @@ -1909,8 +1916,6 @@ static __always_inline void slab_free(st
>  	local_irq_save(flags);
>  	c = __this_cpu_ptr(s->cpu_slab);
>  
> -	slab_free_hook_irq(s, x);
> -
>  	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
>  		set_freepointer(s, object, c->freelist);
>  		c->freelist = object;
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double Christoph Lameter
@ 2011-02-25 18:25   ` Mathieu Desnoyers
  2011-02-25 20:28   ` Steven Rostedt
  2011-02-28 10:22   ` [PATCH] percpu: Generic support for this_cpu_cmpxchg_double() this_cpu_cmpxchg_double Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2011-02-25 18:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin

* Christoph Lameter (cl@linux.com) wrote:
> Introduce this_cpu_cmpxchg_double. this_cpu_cmpxchg_double() allows the
> comparision between two consecutive words and replaces them if there is

comparison

/nitpick again ;)

Mathieu

> a match.
> 
> 	bool this_cpu_cmpxchg_double(pcp1, pcp2,
> 		old_word1, old_word2, new_word1, new_word2)
> 
> this_cpu_cmpxchg_double does not return the old value (difficult since
> there are two words) but a boolean indicating if the operation was
> successful.
> 
> The first percpu variable must be double word aligned!
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  include/linux/percpu.h |  130 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2011-01-10 10:22:35.000000000 -0600
> +++ linux-2.6/include/linux/percpu.h	2011-01-10 10:26:43.000000000 -0600
> @@ -255,6 +255,29 @@ extern void __bad_size_call_parameter(vo
>  	pscr2_ret__;							\
>  })
>  
> +/*
> + * Special handling for cmpxchg_double. cmpxchg_double is passed two
> + * percpu variables. The first has to be aligned to a double word
> + * boundary and the second has to follow directly thereafter.
> + */
> +#define __pcpu_double_call_return_int(stem, pcp1, pcp2, ...)		\
> +({									\
> +	int ret__;							\
> +	__verify_pcpu_ptr(&pcp1);					\
> +	VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));		\
> +	VM_BUG_ON((unsigned long)(&pcp2) != (unsigned long)(&pcp1) + sizeof(pcp1));\
> +	VM_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\
> +	switch(sizeof(pcp1)) {						\
> +	case 1: ret__ = stem##1(pcp1, pcp2, __VA_ARGS__);break;		\
> +	case 2: ret__ = stem##2(pcp1, pcp2, __VA_ARGS__);break;		\
> +	case 4: ret__ = stem##4(pcp1, pcp2, __VA_ARGS__);break;		\
> +	case 8: ret__ = stem##8(pcp1, pcp2, __VA_ARGS__);break;		\
> +	default:							\
> +		__bad_size_call_parameter();break;			\
> +	}								\
> +	ret__;								\
> +})
> +
>  #define __pcpu_size_call(stem, variable, ...)				\
>  do {									\
>  	__verify_pcpu_ptr(&(variable));					\
> @@ -318,6 +341,80 @@ do {									\
>  # define this_cpu_read(pcp)	__pcpu_size_call_return(this_cpu_read_, (pcp))
>  #endif
>  
> +/*
> + * cmpxchg_double replaces two adjacent scalars at once. The first two
> + * parameters are per cpu variables which have to be of the same size.
> + * A truth value is returned to indicate success or
> + * failure (since a double register result is difficult to handle).
> + * There is very limited hardware support for these operations. So only certain
> + * sizes may work.
> + */
> +#define __this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +({									\
> +	int __ret = 0;							\
> +	if (__this_cpu_read(pcp1) == (oval1) &&				\
> +			 __this_cpu_read(pcp2)  == (oval2)) {		\
> +		__this_cpu_write(pcp1, (nval1));			\
> +		__this_cpu_write(pcp2, (nval2));			\
> +		__ret = 1;						\
> +	}								\
> +	(__ret);							\
> +})
> +
> +#ifndef __this_cpu_cmpxchg_double
> +# ifndef __this_cpu_cmpxchg_double_1
> +#  define __this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef __this_cpu_cmpxchg_double_2
> +#  define __this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef __this_cpu_cmpxchg_double_4
> +#  define __this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef __this_cpu_cmpxchg_double_8
> +#  define __this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# define __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__pcpu_double_call_return_int(__this_cpu_cmpxchg_double_, (pcp1), (pcp2)	\
> +					 oval1, oval2, nval1, nval2)
> +#endif
> +
> +#define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +({									\
> +	int ret__;							\
> +	preempt_disable();						\
> +	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
> +			oval1, oval2, nval1, nval2);			\
> +	preempt_enable();						\
> +	ret__;								\
> +})
> +
> +#ifndef this_cpu_cmpxchg_double
> +# ifndef this_cpu_cmpxchg_double_1
> +#  define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef this_cpu_cmpxchg_double_2
> +#  define this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef this_cpu_cmpxchg_double_4
> +#  define this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef this_cpu_cmpxchg_double_8
> +#  define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__pcpu_double_call_return_int(this_cpu_cmpxchg_double_, (pcp1), (pcp2),	\
> +		oval1, oval2, nval1, nval2)
> +#endif
> +
>  #define _this_cpu_generic_to_op(pcp, val, op)				\
>  do {									\
>  	preempt_disable();						\
> @@ -823,4 +920,37 @@ do {									\
>  	__pcpu_size_call_return2(irqsafe_cpu_cmpxchg_, (pcp), oval, nval)
>  #endif
>  
> +#define irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +({									\
> +	int ret__;							\
> +	unsigned long flags;						\
> +	local_irq_save(flags);						\
> +	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
> +			oval1, oval2, nval1, nval2);			\
> +	local_irq_restore(flags);					\
> +	ret__;								\
> +})
> +
> +#ifndef irqsafe_cpu_cmpxchg_double
> +# ifndef irqsafe_cpu_cmpxchg_double_1
> +#  define irqsafe_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_double_2
> +#  define irqsafe_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_double_4
> +#  define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_double_8
> +#  define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> +# endif
> +# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
> +	__pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2),	\
> +		oval1, oval2, nval1, nval2)
> +#endif
> +
>  #endif /* __LINUX_PERCPU_H */
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double Christoph Lameter
  2011-02-25 18:25   ` Mathieu Desnoyers
@ 2011-02-25 20:28   ` Steven Rostedt
  2011-02-25 20:44     ` Christoph Lameter
  2011-02-28 10:22   ` [PATCH] percpu: Generic support for this_cpu_cmpxchg_double() this_cpu_cmpxchg_double Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2011-02-25 20:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin, Mathieu Desnoyers

On Fri, Feb 25, 2011 at 11:38:53AM -0600, Christoph Lameter wrote:
>  
> +/*
> + * Special handling for cmpxchg_double. cmpxchg_double is passed two
> + * percpu variables. The first has to be aligned to a double word
> + * boundary and the second has to follow directly thereafter.
> + */
> +#define __pcpu_double_call_return_int(stem, pcp1, pcp2, ...)		\
> +({									\
> +	int ret__;							\
> +	__verify_pcpu_ptr(&pcp1);					\
> +	VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));		\
> +	VM_BUG_ON((unsigned long)(&pcp2) != (unsigned long)(&pcp1) + sizeof(pcp1));\
> +	VM_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\

Since this is a macro, and it looks like all these are constants (sizeof
and addresses), couldn't you just do a BUILD_BUG_ON() instead?

-- Steve


> +	switch(sizeof(pcp1)) {						\
> +	case 1: ret__ = stem##1(pcp1, pcp2, __VA_ARGS__);break;		\
> +	case 2: ret__ = stem##2(pcp1, pcp2, __VA_ARGS__);break;		\
> +	case 4: ret__ = stem##4(pcp1, pcp2, __VA_ARGS__);break;		\
> +	case 8: ret__ = stem##8(pcp1, pcp2, __VA_ARGS__);break;		\
> +	default:							\
> +		__bad_size_call_parameter();break;			\
> +	}								\
> +	ret__;								\
> +})
> +

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 20:28   ` Steven Rostedt
@ 2011-02-25 20:44     ` Christoph Lameter
  2011-02-25 20:53       ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 20:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin, Mathieu Desnoyers

On Fri, 25 Feb 2011, Steven Rostedt wrote:

> > +	VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));		\
> > +	VM_BUG_ON((unsigned long)(&pcp2) != (unsigned long)(&pcp1) + sizeof(pcp1));\
> > +	VM_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\
>
> Since this is a macro, and it looks like all these are constants (sizeof
> and addresses), couldn't you just do a BUILD_BUG_ON() instead?

The addresses are not constant.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub
  2011-02-25 18:21   ` Mathieu Desnoyers
@ 2011-02-25 20:46     ` Christoph Lameter
  2011-02-25 20:56       ` Mathieu Desnoyers
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 20:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin

On Fri, 25 Feb 2011, Mathieu Desnoyers wrote:

> > Index: linux-2.6/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/slub_def.h	2011-02-25 10:45:49.000000000 -0600
> > +++ linux-2.6/include/linux/slub_def.h	2011-02-25 10:46:19.000000000 -0600
> > @@ -35,7 +35,10 @@ enum stat_item {
> >  	NR_SLUB_STAT_ITEMS };
> >
> >  struct kmem_cache_cpu {
> > -	void **freelist;	/* Pointer to first free per cpu object */
> > +	void **freelist;	/* Pointer to next available object */
> > +#ifdef CONFIG_CMPXCHG_LOCAL
> > +	unsigned long tid;	/* Globally unique transaction id */
> > +#endif
>
> There seem to be no strong guarantee that freelist is double-word aligned here.

The struct kmem_cache_cpu allocation via alloc_percpu() specifies double
word alignment. See the remainder of the code quoted by you:

> > +#ifdef CONFIG_CMPXCHG_LOCAL
> > +	/*
> > +	 * Must align to double word boundary for the double cmpxchg instructions
> > +	 * to work.
> > +	 */
> > +	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu), 2 * sizeof(void *));
> > +#else
> > +	/* Regular alignment is sufficient */
> >  	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
> > +#endif
> > +
> > +	if (!s->cpu_slab)
> > +		return 0;
> >
> > -	return s->cpu_slab != NULL;
> > +	init_kmem_cache_cpus(s);
> > +
> > +	return 1;
> >  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 20:44     ` Christoph Lameter
@ 2011-02-25 20:53       ` Steven Rostedt
  2011-02-25 20:58         ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2011-02-25 20:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin, Mathieu Desnoyers

On Fri, 2011-02-25 at 14:44 -0600, Christoph Lameter wrote:
> On Fri, 25 Feb 2011, Steven Rostedt wrote:
> 
> > > +	VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));		\
> > > +	VM_BUG_ON((unsigned long)(&pcp2) != (unsigned long)(&pcp1) + sizeof(pcp1));\
> > > +	VM_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\
> >
> > Since this is a macro, and it looks like all these are constants (sizeof
> > and addresses), couldn't you just do a BUILD_BUG_ON() instead?
> 
> The addresses are not constant.

I was thinking that if these are per_cpu then they would be global and
thus constant. But those are done at link time, which is too late.

OK, nevermind ;)

-- Steve




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub
  2011-02-25 20:46     ` Christoph Lameter
@ 2011-02-25 20:56       ` Mathieu Desnoyers
  0 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2011-02-25 20:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin

* Christoph Lameter (cl@linux.com) wrote:
> On Fri, 25 Feb 2011, Mathieu Desnoyers wrote:
> 
> > > Index: linux-2.6/include/linux/slub_def.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/slub_def.h	2011-02-25 10:45:49.000000000 -0600
> > > +++ linux-2.6/include/linux/slub_def.h	2011-02-25 10:46:19.000000000 -0600
> > > @@ -35,7 +35,10 @@ enum stat_item {
> > >  	NR_SLUB_STAT_ITEMS };
> > >
> > >  struct kmem_cache_cpu {
> > > -	void **freelist;	/* Pointer to first free per cpu object */
> > > +	void **freelist;	/* Pointer to next available object */
> > > +#ifdef CONFIG_CMPXCHG_LOCAL
> > > +	unsigned long tid;	/* Globally unique transaction id */
> > > +#endif
> >
> > There seem to be no strong guarantee that freelist is double-word aligned here.
> 
> The struct kmem_cache_cpu allocation via alloc_percpu() specifies double
> word alignment. See the remainder of the code quoted by you:

So adding a comment on top of struct kmem_cache_cpu declaration might be
appropriate too, just in case it is ever defined elsewhere.

Thanks,

Mathieu

> 
> > > +#ifdef CONFIG_CMPXCHG_LOCAL
> > > +	/*
> > > +	 * Must align to double word boundary for the double cmpxchg instructions
> > > +	 * to work.
> > > +	 */
> > > +	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu), 2 * sizeof(void *));
> > > +#else
> > > +	/* Regular alignment is sufficient */
> > >  	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
> > > +#endif
> > > +
> > > +	if (!s->cpu_slab)
> > > +		return 0;
> > >
> > > -	return s->cpu_slab != NULL;
> > > +	init_kmem_cache_cpus(s);
> > > +
> > > +	return 1;
> > >  }

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 20:53       ` Steven Rostedt
@ 2011-02-25 20:58         ` Christoph Lameter
  2011-02-25 21:01           ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2011-02-25 20:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin, Mathieu Desnoyers

On Fri, 25 Feb 2011, Steven Rostedt wrote:

> I was thinking that if these are per_cpu then they would be global and
> thus constant. But those are done at link time, which is too late.

Well per_cpu data can also be dynamically allocated via alloc_percpu.
Which is the use case here.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double
  2011-02-25 20:58         ` Christoph Lameter
@ 2011-02-25 21:01           ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2011-02-25 21:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, Pekka Enberg, linux-kernel, Eric Dumazet,
	H. Peter Anvin, Mathieu Desnoyers

On Fri, 2011-02-25 at 14:58 -0600, Christoph Lameter wrote:
> On Fri, 25 Feb 2011, Steven Rostedt wrote:
> 
> > I was thinking that if these are per_cpu then they would be global and
> > thus constant. But those are done at link time, which is too late.
> 
> Well per_cpu data can also be dynamically allocated via alloc_percpu.
> Which is the use case here.

Good point, I forgot about that. Oh well, looks like runtime checking is
all we can do.

-- Steve



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] percpu: Generic support for this_cpu_cmpxchg_double() this_cpu_cmpxchg_double
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double Christoph Lameter
  2011-02-25 18:25   ` Mathieu Desnoyers
  2011-02-25 20:28   ` Steven Rostedt
@ 2011-02-28 10:22   ` Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2011-02-28 10:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

>From 7c3343392172ba98d9d90a83edcc4c2e80897009 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <cl@linux.com>
Date: Mon, 28 Feb 2011 11:02:24 +0100

Introduce this_cpu_cmpxchg_double().  this_cpu_cmpxchg_double() allows
the comparison between two consecutive words and replaces them if
there is a match.

	bool this_cpu_cmpxchg_double(pcp1, pcp2,
		old_word1, old_word2, new_word1, new_word2)

this_cpu_cmpxchg_double does not return the old value (difficult since
there are two words) but a boolean indicating if the operation was
successful.

The first percpu variable must be double word aligned!

-tj: Updated to return bool instead of int, converted size check to
     BUILD_BUG_ON() instead of VM_BUG_ON() and other cosmetic changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
Applied to percpu:for-2.6.39 and pushed out to percpu:for-next.  Thank
you.

 include/linux/percpu.h |  128 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 128 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 27c3c6f..3a5c444 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -255,6 +255,30 @@ extern void __bad_size_call_parameter(void);
 	pscr2_ret__;							\
 })
 
+/*
+ * Special handling for cmpxchg_double.  cmpxchg_double is passed two
+ * percpu variables.  The first has to be aligned to a double word
+ * boundary and the second has to follow directly thereafter.
+ */
+#define __pcpu_double_call_return_bool(stem, pcp1, pcp2, ...)		\
+({									\
+	bool pdcrb_ret__;						\
+	__verify_pcpu_ptr(&pcp1);					\
+	BUILD_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\
+	VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));		\
+	VM_BUG_ON((unsigned long)(&pcp2) !=				\
+		  (unsigned long)(&pcp1) + sizeof(pcp1));		\
+	switch(sizeof(pcp1)) {						\
+	case 1: pdcrb_ret__ = stem##1(pcp1, pcp2, __VA_ARGS__); break;	\
+	case 2: pdcrb_ret__ = stem##2(pcp1, pcp2, __VA_ARGS__); break;	\
+	case 4: pdcrb_ret__ = stem##4(pcp1, pcp2, __VA_ARGS__); break;	\
+	case 8: pdcrb_ret__ = stem##8(pcp1, pcp2, __VA_ARGS__); break;	\
+	default:							\
+		__bad_size_call_parameter(); break;			\
+	}								\
+	pdcrb_ret__;							\
+})
+
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -501,6 +525,45 @@ do {									\
 #endif
 
 /*
+ * cmpxchg_double replaces two adjacent scalars at once.  The first
+ * two parameters are per cpu variables which have to be of the same
+ * size.  A truth value is returned to indicate success or failure
+ * (since a double register result is difficult to handle).  There is
+ * very limited hardware support for these operations, so only certain
+ * sizes may work.
+ */
+#define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+({									\
+	int ret__;							\
+	preempt_disable();						\
+	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
+			oval1, oval2, nval1, nval2);			\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_cmpxchg_double
+# ifndef this_cpu_cmpxchg_double_1
+#  define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_2
+#  define this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_4
+#  define this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef this_cpu_cmpxchg_double_8
+#  define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+#endif
+
+/*
  * Generic percpu operations that do not require preemption handling.
  * Either we do not care about races or the caller has the
  * responsibility of handling preemptions issues. Arch code can still
@@ -703,6 +766,39 @@ do {									\
 	__pcpu_size_call_return2(__this_cpu_cmpxchg_, pcp, oval, nval)
 #endif
 
+#define __this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+({									\
+	int __ret = 0;							\
+	if (__this_cpu_read(pcp1) == (oval1) &&				\
+			 __this_cpu_read(pcp2)  == (oval2)) {		\
+		__this_cpu_write(pcp1, (nval1));			\
+		__this_cpu_write(pcp2, (nval2));			\
+		__ret = 1;						\
+	}								\
+	(__ret);							\
+})
+
+#ifndef __this_cpu_cmpxchg_double
+# ifndef __this_cpu_cmpxchg_double_1
+#  define __this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_2
+#  define __this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_4
+#  define __this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef __this_cpu_cmpxchg_double_8
+#  define __this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# define __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__pcpu_double_call_return_bool(__this_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+#endif
+
 /*
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another
@@ -823,4 +919,36 @@ do {									\
 	__pcpu_size_call_return2(irqsafe_cpu_cmpxchg_, (pcp), oval, nval)
 #endif
 
+#define irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+({									\
+	int ret__;							\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
+	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
+			oval1, oval2, nval1, nval2);			\
+	local_irq_restore(flags);					\
+	ret__;								\
+})
+
+#ifndef irqsafe_cpu_cmpxchg_double
+# ifndef irqsafe_cpu_cmpxchg_double_1
+#  define irqsafe_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_double_2
+#  define irqsafe_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_double_4
+#  define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_double_8
+#  define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+# endif
+# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
+	__pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+#endif
+
 #endif /* __LINUX_PERCPU_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH] percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 5/5] x86: this_cpu_cmpxchg_double() support Christoph Lameter
@ 2011-02-28 10:23   ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2011-02-28 10:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

>From b9ec40af0e18fb7d02106be148036c2ea490fdf9 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <cl@linux.com>
Date: Mon, 28 Feb 2011 11:02:24 +0100

Support this_cpu_cmpxchg_double() using the cmpxchg16b and cmpxchg8b
instructions.

-tj: s/percpu_cmpxchg16b/percpu_cmpxchg16b_double/ for consistency and
     other cosmetic changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
Applied to percpu:for-2.6.39 and pushed out to :for-next.  Thank you.

 arch/x86/include/asm/percpu.h |   48 +++++++++++++++++++++++++++++++++
 arch/x86/lib/Makefile         |    1 +
 arch/x86/lib/cmpxchg16b_emu.S |   59 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 108 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/lib/cmpxchg16b_emu.S

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 3788f46..260ac7a 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -451,6 +451,26 @@ do {									\
 #define irqsafe_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 #endif /* !CONFIG_M386 */
 
+#ifdef CONFIG_X86_CMPXCHG64
+#define percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)			\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy = n2;					\
+	asm volatile("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t"	\
+		    : "=a"(__ret), "=m" (pcp1), "=d"(__dummy)		\
+		    :  "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2));	\
+	__ret;								\
+})
+
+#define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#endif /* CONFIG_X86_CMPXCHG64 */
+
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -480,6 +500,34 @@ do {									\
 #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define irqsafe_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
 #define irqsafe_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
+
+/*
+ * Pretty complex macro to generate cmpxchg16 instruction.  The instruction
+ * is not supported on early AMD64 processors so we must be able to emulate
+ * it in software.  The address used in the cmpxchg16 instruction must be
+ * aligned to a 16 byte boundary.
+ */
+#define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)			\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy;						\
+	alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4,	\
+		       "cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t",	\
+		       X86_FEATURE_CX16,				\
+		       ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)),		\
+		       "S" (&pcp1), "b"(__n1), "c"(__n2),		\
+		       "a"(__o1), "d"(__o2));				\
+	__ret;								\
+})
+
+#define __this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
+#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
+
 #endif
 
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index e10cf07..f2479f1 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -42,4 +42,5 @@ else
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o rwlock_64.o copy_user_nocache_64.o
 	lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem_64.o
+	lib-y += cmpxchg16b_emu.o
 endif
diff --git a/arch/x86/lib/cmpxchg16b_emu.S b/arch/x86/lib/cmpxchg16b_emu.S
new file mode 100644
index 0000000..3e8b08a
--- /dev/null
+++ b/arch/x86/lib/cmpxchg16b_emu.S
@@ -0,0 +1,59 @@
+/*
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; version 2
+ *	of the License.
+ *
+ */
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/frame.h>
+#include <asm/dwarf2.h>
+
+.text
+
+/*
+ * Inputs:
+ * %rsi : memory location to compare
+ * %rax : low 64 bits of old value
+ * %rdx : high 64 bits of old value
+ * %rbx : low 64 bits of new value
+ * %rcx : high 64 bits of new value
+ * %al  : Operation successful
+ */
+ENTRY(this_cpu_cmpxchg16b_emu)
+CFI_STARTPROC
+
+#
+# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in %al not
+# via the ZF.  Caller will access %al to get result.
+#
+# Note that this is only useful for a cpuops operation.  Meaning that we
+# do *not* have a fully atomic operation but just an operation that is
+# *atomic* on a single cpu (as provided by the this_cpu_xx class of
+# macros).
+#
+this_cpu_cmpxchg16b_emu:
+	pushf
+	cli
+
+	cmpq %gs:(%rsi), %rax
+	jne not_same
+	cmpq %gs:8(%rsi), %rdx
+	jne not_same
+
+	movq %rbx, %gs:(%rsi)
+	movq %rcx, %gs:8(%rsi)
+
+	popf
+	mov $1, %al
+	ret
+
+ not_same:
+	popf
+	xor %al,%al
+	ret
+
+CFI_ENDPROC
+
+ENDPROC(this_cpu_cmpxchg16b_emu)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support
  2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
                   ` (4 preceding siblings ...)
  2011-02-25 17:38 ` [cpuops cmpxchg double V3 5/5] x86: this_cpu_cmpxchg_double() support Christoph Lameter
@ 2011-02-28 10:36 ` Tejun Heo
  5 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2011-02-28 10:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, Eric Dumazet, H. Peter Anvin,
	Mathieu Desnoyers

On Fri, Feb 25, 2011 at 11:38:50AM -0600, Christoph Lameter wrote:
> This patch series introduces this_cpu_cmpxchg_double().
> 
> x86 cpus support cmpxchg16b and cmpxchg8b instuction which are capable of
> switching two words instead of one during a cmpxchg.
> Two words allow to swap more state in an atomic instruction.
> 
> this_cpu_cmpxchg_double() is used in the slub allocator to avoid
> interrupt disable/enable in both alloc and free fastpath.
> Using the new operation significantly speeds up the fastpaths.

Pekka, Christoph, I applied the third and fifth patches to the percpu
tree.  Please feel free to pull from the following branch and apply
slub changes on top of it.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-2.6.39

HEAD is b9ec40af0e18fb7d02106be148036c2ea490fdf9.  As git.korg seems a
bit slow to sync these days, it may be better to pull from
master.korg.

  ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-2.6.39

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-02-28 10:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-25 17:38 [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Christoph Lameter
2011-02-25 17:38 ` [cpuops cmpxchg double V3 1/5] slub: min_partial needs to be in first cacheline Christoph Lameter
2011-02-25 17:38 ` [cpuops cmpxchg double V3 2/5] slub: Get rid of slab_free_hook_irq() Christoph Lameter
2011-02-25 18:23   ` Mathieu Desnoyers
2011-02-25 17:38 ` [cpuops cmpxchg double V3 3/5] Generic support for this_cpu_cmpxchg_double Christoph Lameter
2011-02-25 18:25   ` Mathieu Desnoyers
2011-02-25 20:28   ` Steven Rostedt
2011-02-25 20:44     ` Christoph Lameter
2011-02-25 20:53       ` Steven Rostedt
2011-02-25 20:58         ` Christoph Lameter
2011-02-25 21:01           ` Steven Rostedt
2011-02-28 10:22   ` [PATCH] percpu: Generic support for this_cpu_cmpxchg_double() this_cpu_cmpxchg_double Tejun Heo
2011-02-25 17:38 ` [cpuops cmpxchg double V3 4/5] Lockless (and preemptless) fastpaths for slub Christoph Lameter
2011-02-25 18:21   ` Mathieu Desnoyers
2011-02-25 20:46     ` Christoph Lameter
2011-02-25 20:56       ` Mathieu Desnoyers
2011-02-25 17:38 ` [cpuops cmpxchg double V3 5/5] x86: this_cpu_cmpxchg_double() support Christoph Lameter
2011-02-28 10:23   ` [PATCH] percpu, x86: Add arch-specific " Tejun Heo
2011-02-28 10:36 ` [cpuops cmpxchg double V3 0/5] this_cpu_cmpxchg_double support Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox