Re: [patch 08/10] SLUB: Optional fast path using cmpxchg

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch 08/10] SLUB: Optional fast path using cmpxchg_local
       [not found] ` <20071028033300.240703208@sgi.com>
@ 2007-10-30 18:49   ` Andrew Morton
  2007-10-30 18:58     ` Christoph Lameter
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2007-10-30 18:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: matthew, linux-kernel, linux-mm, penberg, linux-arch

On Sat, 27 Oct 2007 20:32:04 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> Provide an alternate implementation of the SLUB fast paths for alloc
> and free using cmpxchg_local. The cmpxchg_local fast path is selected
> for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
> set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
> interrupt enable/disable sequence. This is known to be true for both
> x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.
> 
> Not all arches can support fast cmpxchg operations. Typically the
> architecture must have an optimized cmpxchg instruction. The
> cmpxchg fast path makes no sense on platforms whose cmpxchg is
> slower than interrupt enable/disable (like f.e. IA64).
> 
> The advantages of a cmpxchg_local based fast path are:
> 
> 1. Lower cycle count (30%-60% faster)
> 
> 2. There is no need to disable and enable interrupts on the fast path.
>    Currently interrupts have to be disabled and enabled on every
>    slab operation. This is likely saving a significant percentage
>    of interrupt off / on sequences in the kernel.
> 
> 3. The disposal of freed slabs can occur with interrupts enabled.
> 
> The alternate path is realized using #ifdef's. Several attempts to do the
> same with macros and in line functions resulted in a mess (in particular due
> to the strange way that local_interrupt_save() handles its argument and due
> to the need to define macros/functions that sometimes disable interrupts
> and sometimes do something else. The macro based approaches made it also
> difficult to preserve the optimizations for the non cmpxchg paths).
> 
> #ifdef seems to be the way to go here to have a readable source.
> 
> 
> ---
>  arch/x86/Kconfig.i386   |    4 ++
>  arch/x86/Kconfig.x86_64 |    4 ++
>  mm/slub.c               |   71 ++++++++++++++++++++++++++++++++++++++++++++++--

Let's cc linux-arch: presumably other architectures can implement cpu-local
cmpxchg and would see some benefit from doing so.

The semantics are "atomic wrt interrutps on this cpu, not atomic wrt other
cpus", yes?

Do you have a feel for how useful it would be for arch maintainers to implement
this?  IOW, is it worth their time?

> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2007-10-27 10:39:07.583665939 -0700
> +++ linux-2.6/mm/slub.c	2007-10-27 10:40:19.710415861 -0700
> @@ -1496,7 +1496,12 @@ static void *__slab_alloc(struct kmem_ca
>  {
>  	void **object;
>  	struct page *new;
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	unsigned long flags;
>  
> +	local_irq_save(flags);
> +	preempt_enable_no_resched();
> +#endif
>  	if (!c->page)
>  		goto new_slab;
>  
> @@ -1518,6 +1523,10 @@ load_freelist:
>  unlock_out:
>  	slab_unlock(c->page);
>  out:
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	preempt_disable();
> +	local_irq_restore(flags);
> +#endif
>  	return object;
>  
>  another_slab:
> @@ -1592,9 +1601,26 @@ static void __always_inline *slab_alloc(
>  		gfp_t gfpflags, int node, void *addr)
>  {
>  	void **object;
> -	unsigned long flags;
>  	struct kmem_cache_cpu *c;
>  
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	c = get_cpu_slab(s, get_cpu());
> +	do {
> +		object = c->freelist;
> +		if (unlikely(is_end(object) || !node_match(c, node))) {
> +			object = __slab_alloc(s, gfpflags, node, addr, c);
> +			if (unlikely(!object)) {
> +				put_cpu();
> +				goto out;
> +			}
> +			break;
> +		}
> +	} while (cmpxchg_local(&c->freelist, object, object[c->offset])
> +								!= object);
> +	put_cpu();
> +#else
> +	unsigned long flags;
> +
>  	local_irq_save(flags);
>  	c = get_cpu_slab(s, smp_processor_id());
>  	if (unlikely((is_end(c->freelist)) || !node_match(c, node))) {
> @@ -1609,6 +1635,7 @@ static void __always_inline *slab_alloc(
>  		c->freelist = object[c->offset];
>  	}
>  	local_irq_restore(flags);
> +#endif
>  
>  	if (unlikely((gfpflags & __GFP_ZERO)))
>  		memset(object, 0, c->objsize);
> @@ -1644,6 +1671,11 @@ static void __slab_free(struct kmem_cach
>  	void *prior;
>  	void **object = (void *)x;
>  
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +#endif
>  	slab_lock(page);
>  
>  	if (unlikely(SlabDebug(page)))
> @@ -1669,6 +1701,9 @@ checks_ok:
>  
>  out_unlock:
>  	slab_unlock(page);
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	local_irq_restore(flags);
> +#endif
>  	return;
>  
>  slab_empty:
> @@ -1679,6 +1714,9 @@ slab_empty:
>  		remove_partial(s, page);
>  
>  	slab_unlock(page);
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	local_irq_restore(flags);
> +#endif
>  	discard_slab(s, page);
>  	return;
>  
> @@ -1703,9 +1741,37 @@ static void __always_inline slab_free(st
>  			struct page *page, void *x, void *addr)
>  {
>  	void **object = (void *)x;
> -	unsigned long flags;
>  	struct kmem_cache_cpu *c;
>  
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +	void **freelist;
> +
> +	c = get_cpu_slab(s, get_cpu());
> +	debug_check_no_locks_freed(object, s->objsize);
> +	do {
> +		freelist = c->freelist;
> +		barrier();
> +		/*
> +		 * If the compiler would reorder the retrieval of c->page to
> +		 * come before c->freelist then an interrupt could
> +		 * change the cpu slab before we retrieve c->freelist. We
> +		 * could be matching on a page no longer active and put the
> +		 * object onto the freelist of the wrong slab.
> +		 *
> +		 * On the other hand: If we already have the freelist pointer
> +		 * then any change of cpu_slab will cause the cmpxchg to fail
> +		 * since the freelist pointers are unique per slab.
> +		 */
> +		if (unlikely(page != c->page || c->node < 0)) {
> +			__slab_free(s, page, x, addr, c->offset);
> +			break;
> +		}
> +		object[c->offset] = freelist;
> +	} while (cmpxchg_local(&c->freelist, freelist, object) != freelist);
> +	put_cpu();
> +#else
> +	unsigned long flags;
> +
>  	local_irq_save(flags);
>  	debug_check_no_locks_freed(object, s->objsize);
>  	c = get_cpu_slab(s, smp_processor_id());
> @@ -1716,6 +1782,7 @@ static void __always_inline slab_free(st
>  		__slab_free(s, page, x, addr, c->offset);
>  
>  	local_irq_restore(flags);
> +#endif
>  }
>  
>  void kmem_cache_free(struct kmem_cache *s, void *x)
> Index: linux-2.6/arch/x86/Kconfig.i386
> ===================================================================
> --- linux-2.6.orig/arch/x86/Kconfig.i386	2007-10-27 10:38:33.630415778 -0700
> +++ linux-2.6/arch/x86/Kconfig.i386	2007-10-27 10:40:19.710415861 -0700
> @@ -51,6 +51,10 @@ config X86
>  	bool
>  	default y
>  
> +config FAST_CMPXCHG_LOCAL
> +	bool
> +	default y
> +
>  config MMU
>  	bool
>  	default y
> Index: linux-2.6/arch/x86/Kconfig.x86_64
> ===================================================================
> --- linux-2.6.orig/arch/x86/Kconfig.x86_64	2007-10-27 10:38:33.630415778 -0700
> +++ linux-2.6/arch/x86/Kconfig.x86_64	2007-10-27 10:40:19.710415861 -0700
> @@ -97,6 +97,10 @@ config X86_CMPXCHG
>  	bool
>  	default y
>  
> +config FAST_CMPXCHG_LOCAL
> +	bool
> +	default y
> +
>  config EARLY_PRINTK
>  	bool
>  	default y
> 
> -- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch 08/10] SLUB: Optional fast path using cmpxchg_local
  2007-10-30 18:49   ` [patch 08/10] SLUB: Optional fast path using cmpxchg_local Andrew Morton
@ 2007-10-30 18:58     ` Christoph Lameter
  2007-10-30 19:12       ` Mathieu Desnoyers
  2007-10-31  1:52       ` [PATCH] local_t Documentation update 2 Mathieu Desnoyers
  0 siblings, 2 replies; 4+ messages in thread
From: Christoph Lameter @ 2007-10-30 18:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: matthew, linux-kernel, linux-mm, penberg, Mathieu Desnoyers,
	linux-arch

On Tue, 30 Oct 2007, Andrew Morton wrote:

> Let's cc linux-arch: presumably other architectures can implement cpu-local
> cmpxchg and would see some benefit from doing so.

Matheiu had a whole series of cmpxchg_local patches. Ccing him too. I 
think he has some numbers for other architectures.

> The semantics are "atomic wrt interrutps on this cpu, not atomic wrt other
> cpus", yes?

Right.

> Do you have a feel for how useful it would be for arch maintainers to implement
> this?  IOW, is it worth their time?

That depends on the efficiency of a cmpxchg_local vs. the interrupt 
enable/ disable sequence on a particular arch. On x86 this yields about 
50% so it doubles the speed of the fastpath. On other architectures the 
cmpxchg is so slow that it is not worth it (ia64 f.e.)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch 08/10] SLUB: Optional fast path using cmpxchg_local
  2007-10-30 18:58     ` Christoph Lameter
@ 2007-10-30 19:12       ` Mathieu Desnoyers
  2007-10-31  1:52       ` [PATCH] local_t Documentation update 2 Mathieu Desnoyers
  1 sibling, 0 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2007-10-30 19:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, matthew, linux-kernel, linux-mm, penberg,
	linux-arch

* Christoph Lameter (clameter@sgi.com) wrote:
> On Tue, 30 Oct 2007, Andrew Morton wrote:
> 
> > Let's cc linux-arch: presumably other architectures can implement cpu-local
> > cmpxchg and would see some benefit from doing so.
> 
> Matheiu had a whole series of cmpxchg_local patches. Ccing him too. I 
> think he has some numbers for other architectures.
>  

Well, I tested it on x86 and AMD64 only. For slub:

Using cmpxchg_local shows a performance improvements of the fast path
goes from a 66% speedup on a Pentium 4 to a 14% speedup on AMD64.

It really depends on how fast cmpxchg_local is vs disabling interrupts.


> > The semantics are "atomic wrt interrutps on this cpu, not atomic wrt other
> > cpus", yes?
> 
> Right.
> 
> > Do you have a feel for how useful it would be for arch maintainers to implement
> > this?  IOW, is it worth their time?
> 
> That depends on the efficiency of a cmpxchg_local vs. the interrupt 
> enable/ disable sequence on a particular arch. On x86 this yields about 
> 50% so it doubles the speed of the fastpath. On other architectures the 
> cmpxchg is so slow that it is not worth it (ia64 f.e.)

As Christoph pointed out, we even saw a small slowdown on ia64 because
there is no concept of atomicity wrt only one CPU. Emulating this with
irq disable has been tried, but just the supplementary memory barriers
hurts performance a bit. We tried to come up with clever macros that
switch between irq disable and cmpxchg_local depending on the
architecture, but all the results were awkward.

I guess it's time for me to repost my patchset. I use interrupt disable
to emulate the cmpxchg_local on architectures that lacks atomic ops.

# cmpxchg_local and cmpxchg64_local standardization
add-cmpxchg-local-to-generic-for-up.patch
i386-cmpxchg64-80386-80486-fallback.patch
add-cmpxchg64-to-alpha.patch
add-cmpxchg64-to-mips.patch
add-cmpxchg64-to-powerpc.patch
add-cmpxchg64-to-x86_64.patch
#
add-cmpxchg-local-to-arm.patch
add-cmpxchg-local-to-avr32.patch
add-cmpxchg-local-to-blackfin.patch
add-cmpxchg-local-to-cris.patch
add-cmpxchg-local-to-frv.patch
add-cmpxchg-local-to-h8300.patch
add-cmpxchg-local-to-ia64.patch
add-cmpxchg-local-to-m32r.patch
fix-m32r-__xchg.patch
fix-m32r-include-sched-h-in-smpboot.patch
local_t_m32r_optimized.patch
add-cmpxchg-local-to-m68k.patch
add-cmpxchg-local-to-m68knommu.patch
add-cmpxchg-local-to-parisc.patch
add-cmpxchg-local-to-ppc.patch
add-cmpxchg-local-to-s390.patch
add-cmpxchg-local-to-sh.patch
add-cmpxchg-local-to-sh64.patch
add-cmpxchg-local-to-sparc.patch
add-cmpxchg-local-to-sparc64.patch
add-cmpxchg-local-to-v850.patch
add-cmpxchg-local-to-xtensa.patch
#
slub-use-cmpxchg-local.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] local_t Documentation update 2
  2007-10-30 18:58     ` Christoph Lameter
  2007-10-30 19:12       ` Mathieu Desnoyers
@ 2007-10-31  1:52       ` Mathieu Desnoyers
  1 sibling, 0 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2007-10-31  1:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: matthew, linux-kernel, linux-mm, penberg, linux-arch,
	Christoph Lameter

local_t Documentation update 2

(this patch seems to have fallen off the grid, but is still providing
useful information. It applies to 2.6.23-mm1.)

Grant Grundler was asking for more detail about correct usage of local atomic
operations and suggested adding the resulting summary to local_ops.txt.

"Please add a bit more detail. If DaveM is correct (he normally is), then
there must be limits on how the local_t can be used in the kernel process
and interrupt contexts. I'd like those rules spelled out very clearly
since it's easy to get wrong and tracking down such a bug is quite painful."

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Grant Grundler <grundler@parisc-linux.org>
---
 Documentation/local_ops.txt |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Index: linux-2.6-lttng/Documentation/local_ops.txt
===================================================================
--- linux-2.6-lttng.orig/Documentation/local_ops.txt	2007-09-04 11:53:23.000000000 -0400
+++ linux-2.6-lttng/Documentation/local_ops.txt	2007-09-04 12:19:31.000000000 -0400
@@ -68,6 +68,29 @@ typedef struct { atomic_long_t a; } loca
   variable can be read when reading some _other_ cpu's variables.
 
 
+* Rules to follow when using local atomic operations
+
+- Variables touched by local ops must be per cpu variables.
+- _Only_ the CPU owner of these variables must write to them.
+- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
+  to update its local_t variables.
+- Preemption (or interrupts) must be disabled when using local ops in
+  process context to   make sure the process won't be migrated to a
+  different CPU between getting the per-cpu variable and doing the
+  actual local op.
+- When using local ops in interrupt context, no special care must be
+  taken on a mainline kernel, since they will run on the local CPU with
+  preemption already disabled. I suggest, however, to explicitly
+  disable preemption anyway to make sure it will still work correctly on
+  -rt kernels.
+- Reading the local cpu variable will provide the current copy of the
+  variable.
+- Reads of these variables can be done from any CPU, because updates to
+  "long", aligned, variables are always atomic. Since no memory
+  synchronization is done by the writer CPU, an outdated copy of the
+  variable can be read when reading some _other_ cpu's variables.
+
+
 * How to use local atomic operations
 
 #include <linux/percpu.h>
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-10-31  1:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20071028033156.022983073@sgi.com>
     [not found] ` <20071028033300.240703208@sgi.com>
2007-10-30 18:49   ` [patch 08/10] SLUB: Optional fast path using cmpxchg_local Andrew Morton
2007-10-30 18:58     ` Christoph Lameter
2007-10-30 19:12       ` Mathieu Desnoyers
2007-10-31  1:52       ` [PATCH] local_t Documentation update 2 Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).