[this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic
@ 2009-10-06 23:36 cl
  2009-10-06 23:36 ` [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
                   ` (17 more replies)
  0 siblings, 18 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:36 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Mel Gorman, Pekka Enberg

V4->V5:
- Avoid setup_per_cpu_area() modifications and fold the remainder of the
  patch into the page allocator patch.
- Irq disable / per cpu ptr fixes for page allocator patch.

V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
  interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
	 git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
  for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
  can fallback to the use of preempt or interrupt disable. 64 bit
  platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today

The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of

	per_cpu_ptr(x, smp_processor_id())

The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using

	this_cpu_ptr(x)

In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.

F.e.

	this_cpu_add(x->var, 20)

can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the current processor
and then increments the variable by 20. The instruction cannot be interrupted
and therefore the modification is atomic vs the cpu (it either happens or not).
Rescheduling or interrupt can only happen before or after the instruction.

Per cpu atomicness does not provide protection from concurrent modifications from
other processors. In general per cpu data is modified only from the processor
that the per cpu area is associated with. So per cpu atomicness provides a fast
and effective means of dealing with concurrency. It may allow development of
better fastpaths for allocators and other important subsystems.

The per cpu atomic RMW operations can be used to avoid having to dimension pointer
arrays in the allocators (patches for page allocator and slub are provided) and
avoid pointer lookups in the hot paths of the allocators thereby decreasing
latency of critical OS paths. The macros could be used to revise the critical
paths in the allocators to no longer need to disable interrupts (not included).

Per cpu atomic RMW operations are useful to decrease the overhead of counter
maintenance in the kernel. A this_cpu_inc() f.e. can generate a single
instruction that has no needs for registers on x86. preempt on / off can
be avoided in many places.

Patchset will reduce the code size and increase speed of operations for
dynamically allocated per cpu based statistics. A set of patches modifies
the fastpaths of the SLUB allocator reducing code size and cache footprint
through the per cpu atomic operations.

This patch depends on all arches supporting the new per cpu allocator.
IA64 still uses the old percpu allocator. Tejon has patches to fixup IA64
and the patch was approved by Tony Luck but the IA64 patches have not been
merged yet.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
@ 2009-10-06 23:36 ` cl
  2009-10-06 23:52   ` Tejun Heo
  2009-10-06 23:36 ` [this_cpu_xx V5 02/19] this_cpu: X86 optimized this_cpu operations cl
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: cl @ 2009-10-06 23:36 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, David Howells, Tejun Heo, Ingo Molnar,
	Rusty Russell, Eric Dumazet, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_intro --]
[-- Type: text/plain, Size: 19850 bytes --]

This patch introduces two things: First this_cpu_ptr and then per cpu
atomic operations.

this_cpu_ptr
------------

A common operation when dealing with cpu data is to get the instance of the
cpu data associated with the currently executing processor. This can be
optimized by

this_cpu_ptr(xx) = per_cpu_ptr(xx, smp_processor_id).

The problem with per_cpu_ptr(x, smp_processor_id) is that it requires
an array lookup to find the offset for the cpu. Processors typically
have the offset for the current cpu area in some kind of (arch dependent)
efficiently accessible register or memory location.

We can use that instead of doing the array lookup to speed up the
determination of the address of the percpu variable. This is particularly
significant because these lookups occur in performance critical paths
of the core kernel. this_cpu_ptr() can avoid memory accesses and

this_cpu_ptr comes in two flavors. The preemption context matters since we
are referring the the currently executing processor. In many cases we must
insure that the processor does not change while a code segment is executed.

__this_cpu_ptr 	-> Do not check for preemption context
this_cpu_ptr	-> Check preemption context

The parameter to these operations is a per cpu pointer. This can be the
address of a statically defined per cpu variable (&per_cpu_var(xxx)) or
the address of a per cpu variable allocated with the per cpu allocator.

per cpu atomic operations: this_cpu_*(var, val)
-----------------------------------------------
this_cpu_* operations (like this_cpu_add(struct->y, value) operate on
abitrary scalars that are members of structures allocated with the new
per cpu allocator. They can also operate on static per_cpu variables
if they are passed to per_cpu_var() (See patch to use this_cpu_*
operations for vm statistics).

These operations are guaranteed to be atomic vs preemption when modifying
the scalar. The calculation of the per cpu offset is also guaranteed to
be atomic at the same time. This means that a this_cpu_* operation can be
safely used to modify a per cpu variable in a context where interrupts are
enabled and preemption is allowed. Many architectures can perform such
a per cpu atomic operation with a single instruction.

Note that the atomicity here is different from regular atomic operations.
Atomicity is only guaranteed for data accessed from the currently executing
processor. Modifications from other processors are still possible. There
must be other guarantees that the per cpu data is not modified from another
processor when using these instruction. The per cpu atomicity is created
by the fact that the processor either executes and instruction or not.
Embedded in the instruction is the relocation of the per cpu address to
the are reserved for the current processor and the RMW action. Therefore
interrupts or preemption cannot occur in the mids of this processing.

Generic fallback functions are used if an arch does not define optimized
this_cpu operations. The functions come also come in the two flavors used
for this_cpu_ptr().

The firstparameter is a scalar that is a member of a structure allocated
through allocpercpu or a per cpu variable (use per_cpu_var(xxx)). The
operations are similar to what percpu_add() and friends do.

this_cpu_read(scalar)
this_cpu_write(scalar, value)
this_cpu_add(scale, value)
this_cpu_sub(scalar, value)
this_cpu_inc(scalar)
this_cpu_dec(scalar)
this_cpu_and(scalar, value)
this_cpu_or(scalar, value)
this_cpu_xor(scalar, value)

Arch code can override the generic functions and provide optimized atomic
per cpu operations. These atomic operations must provide both the relocation
(x86 does it through a segment override) and the operation on the data in a
single instruction. Otherwise preempt needs to be disabled and there is no
gain from providing arch implementations.

A third variant is provided prefixed by irqsafe_. These variants are safe
against hardware interrupts on the *same* processor (all per cpu atomic
primitives are *always* *only* providing safety for code running on the
*same* processor!). The increment needs to be implemented by the hardware
in such a way that it is a single RMW instruction that is either processed
before or after an interrupt.

cc: David Howells <dhowells@redhat.com>
cc: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/asm-generic/percpu.h |    5 
 include/linux/percpu.h       |  400 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 405 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-10-01 16:14:48.000000000 -0500
+++ linux-2.6/include/linux/percpu.h	2009-10-02 12:05:24.000000000 -0500
@@ -243,4 +243,404 @@ do {									\
 # define percpu_xor(var, val)		__percpu_generic_to_op(var, (val), ^=)
 #endif
 
+/*
+ * Branching function to split up a function into a set of functions that
+ * are called for different scalar sizes of the objects handled.
+ */
+
+extern void __bad_size_call_parameter(void);
+
+#define __size_call_return(stem, variable)				\
+({	typeof(variable) ret__;						\
+	switch(sizeof(variable)) {					\
+	case 1: ret__ = stem##1(variable);break;			\
+	case 2: ret__ = stem##2(variable);break;			\
+	case 4: ret__ = stem##4(variable);break;			\
+	case 8: ret__ = stem##8(variable);break;			\
+	default:							\
+		__bad_size_call_parameter();break;			\
+	}								\
+	ret__;								\
+})
+
+#define __size_call(stem, variable, ...)				\
+do {									\
+	switch(sizeof(variable)) {					\
+		case 1: stem##1(variable, __VA_ARGS__);break;		\
+		case 2: stem##2(variable, __VA_ARGS__);break;		\
+		case 4: stem##4(variable, __VA_ARGS__);break;		\
+		case 8: stem##8(variable, __VA_ARGS__);break;		\
+		default: 						\
+			__bad_size_call_parameter();break;		\
+	}								\
+} while (0)
+
+/*
+ * Optimized manipulation for memory allocated through the per cpu
+ * allocator or for addresses of per cpu variables (can be determined
+ * using per_cpu_var(xx).
+ *
+ * These operation guarantee exclusivity of access for other operations
+ * on the *same* processor. The assumption is that per cpu data is only
+ * accessed by a single processor instance (the current one).
+ *
+ * The first group is used for accesses that must be done in a
+ * preemption safe way since we know that the context is not preempt
+ * safe. Interrupts may occur. If the interrupt modifies the variable
+ * too then RMW actions will not be reliable.
+ *
+ * The arch code can provide optimized functions in two ways:
+ *
+ * 1. Override the function completely. F.e. define this_cpu_add().
+ *    The arch must then ensure that the various scalar format passed
+ *    are handled correctly.
+ *
+ * 2. Provide functions for certain scalar sizes. F.e. provide
+ *    this_cpu_add_2() to provide per cpu atomic operations for 2 byte
+ *    sized RMW actions. If arch code does not provide operations for
+ *    a scalar size then the fallback in the generic code will be
+ *    used.
+ */
+
+#define _this_cpu_generic_read(pcp)					\
+({	typeof(pcp) ret__;						\
+	preempt_disable();						\
+	ret__ = *this_cpu_ptr(&(pcp));					\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_read
+# ifndef this_cpu_read_1
+#  define this_cpu_read_1(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# ifndef this_cpu_read_2
+#  define this_cpu_read_2(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# ifndef this_cpu_read_4
+#  define this_cpu_read_4(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# ifndef this_cpu_read_8
+#  define this_cpu_read_8(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# define this_cpu_read(pcp)	__size_call_return(this_cpu_read_, (pcp))
+#endif
+
+#define _this_cpu_generic_to_op(pcp, val, op)				\
+do {									\
+	preempt_disable();						\
+	*__this_cpu_ptr(&pcp) op val;					\
+	preempt_enable();						\
+} while (0)
+
+#ifndef this_cpu_write
+# ifndef this_cpu_write_1
+#  define this_cpu_write_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef this_cpu_write_2
+#  define this_cpu_write_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef this_cpu_write_4
+#  define this_cpu_write_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef this_cpu_write_8
+#  define this_cpu_write_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# define this_cpu_write(pcp, val)	__size_call(this_cpu_write_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_add
+# ifndef this_cpu_add_1
+#  define this_cpu_add_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef this_cpu_add_2
+#  define this_cpu_add_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef this_cpu_add_4
+#  define this_cpu_add_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef this_cpu_add_8
+#  define this_cpu_add_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# define this_cpu_add(pcp, val)		__size_call(this_cpu_add_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_sub
+# define this_cpu_sub(pcp, val)		this_cpu_add((pcp), -(val))
+#endif
+
+#ifndef this_cpu_inc
+# define this_cpu_inc(pcp)		this_cpu_add((pcp), 1)
+#endif
+
+#ifndef this_cpu_dec
+# define this_cpu_dec(pcp)		this_cpu_sub((pcp), 1)
+#endif
+
+#ifndef this_cpu_and
+# ifndef this_cpu_and_1
+#  define this_cpu_and_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef this_cpu_and_2
+#  define this_cpu_and_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef this_cpu_and_4
+#  define this_cpu_and_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef this_cpu_and_8
+#  define this_cpu_and_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# define this_cpu_and(pcp, val)		__size_call(this_cpu_and_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_or
+# ifndef this_cpu_or_1
+#  define this_cpu_or_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef this_cpu_or_2
+#  define this_cpu_or_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef this_cpu_or_4
+#  define this_cpu_or_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef this_cpu_or_8
+#  define this_cpu_or_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# define this_cpu_or(pcp, val)		__size_call(this_cpu_or_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_xor
+# ifndef this_cpu_xor_1
+#  define this_cpu_xor_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef this_cpu_xor_2
+#  define this_cpu_xor_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef this_cpu_xor_4
+#  define this_cpu_xor_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef this_cpu_xor_8
+#  define this_cpu_xor_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# define this_cpu_xor(pcp, val)		__size_call(this_cpu_or_, (pcp), (val))
+#endif
+
+/*
+ * Generic percpu operations that do not require preemption handling.
+ * Either we do not care about races or the caller has the
+ * responsibility of handling preemptions issues. Arch code can still
+ * override these instructions since the arch per cpu code may be more
+ * efficient and may actually get race freeness for free (that is the
+ * case for x86 for example).
+ *
+ * If there is no other protection through preempt disable and/or
+ * disabling interupts then one of these RMW operations can show unexpected
+ * behavior because the execution thread was rescheduled on another processor
+ * or an interrupt occurred and the same percpu variable was modified from
+ * the interrupt context.
+ */
+#ifndef __this_cpu_read
+# ifndef __this_cpu_read_1
+#  define __this_cpu_read_1(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# ifndef __this_cpu_read_2
+#  define __this_cpu_read_2(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# ifndef __this_cpu_read_4
+#  define __this_cpu_read_4(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# ifndef __this_cpu_read_8
+#  define __this_cpu_read_8(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# define __this_cpu_read(pcp)	__size_call_return(__this_cpu_read_, (pcp))
+#endif
+
+#define __this_cpu_generic_to_op(pcp, val, op)				\
+do {									\
+	*__this_cpu_ptr(&(pcp)) op val;					\
+} while (0)
+
+#ifndef __this_cpu_write
+# ifndef __this_cpu_write_1
+#  define __this_cpu_write_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef __this_cpu_write_2
+#  define __this_cpu_write_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef __this_cpu_write_4
+#  define __this_cpu_write_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef __this_cpu_write_8
+#  define __this_cpu_write_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# define __this_cpu_write(pcp, val)	__size_call(__this_cpu_write_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_add
+# ifndef __this_cpu_add_1
+#  define __this_cpu_add_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef __this_cpu_add_2
+#  define __this_cpu_add_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef __this_cpu_add_4
+#  define __this_cpu_add_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef __this_cpu_add_8
+#  define __this_cpu_add_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# define __this_cpu_add(pcp, val)	__size_call(__this_cpu_add_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_sub
+# define __this_cpu_sub(pcp, val)	__this_cpu_add((pcp), -(val))
+#endif
+
+#ifndef __this_cpu_inc
+# define __this_cpu_inc(pcp)		__this_cpu_add((pcp), 1)
+#endif
+
+#ifndef __this_cpu_dec
+# define __this_cpu_dec(pcp)		__this_cpu_sub((pcp), 1)
+#endif
+
+#ifndef __this_cpu_and
+# ifndef __this_cpu_and_1
+#  define __this_cpu_and_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef __this_cpu_and_2
+#  define __this_cpu_and_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef __this_cpu_and_4
+#  define __this_cpu_and_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef __this_cpu_and_8
+#  define __this_cpu_and_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# define __this_cpu_and(pcp, val)	__size_call(__this_cpu_and_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_or
+# ifndef __this_cpu_or_1
+#  define __this_cpu_or_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef __this_cpu_or_2
+#  define __this_cpu_or_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef __this_cpu_or_4
+#  define __this_cpu_or_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef __this_cpu_or_8
+#  define __this_cpu_or_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# define __this_cpu_or(pcp, val)	__size_call(__this_cpu_or_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_xor
+# ifndef __this_cpu_xor_1
+#  define __this_cpu_xor_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef __this_cpu_xor_2
+#  define __this_cpu_xor_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef __this_cpu_xor_4
+#  define __this_cpu_xor_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef __this_cpu_xor_8
+#  define __this_cpu_xor_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# define __this_cpu_xor(pcp, val)	__size_call(__this_cpu_xor_, (pcp), (val))
+#endif
+
+/*
+ * IRQ safe versions of the per cpu RMW operations. Note that these operations
+ * are *not* safe against modification of the same variable from another
+ * processors (which one gets when using regular atomic operations)
+ . They are guaranteed to be atomic vs. local interrupts and
+ * preemption only.
+ */
+#define irqsafe_cpu_generic_to_op(pcp, val, op)				\
+do {									\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
+	*__this_cpu_ptr(&(pcp)) op val;					\
+	local_irq_restore(flags);					\
+} while (0)
+
+#ifndef irqsafe_cpu_add
+# ifndef irqsafe_cpu_add_1
+#  define irqsafe_cpu_add_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef irqsafe_cpu_add_2
+#  define irqsafe_cpu_add_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef irqsafe_cpu_add_4
+#  define irqsafe_cpu_add_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef irqsafe_cpu_add_8
+#  define irqsafe_cpu_add_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# define irqsafe_cpu_add(pcp, val) __size_call(irqsafe_cpu_add_, (pcp), (val))
+#endif
+
+#ifndef irqsafe_cpu_sub
+# define irqsafe_cpu_sub(pcp, val)	irqsafe_cpu_add((pcp), -(val))
+#endif
+
+#ifndef irqsafe_cpu_inc
+# define irqsafe_cpu_inc(pcp)	irqsafe_cpu_add((pcp), 1)
+#endif
+
+#ifndef irqsafe_cpu_dec
+# define irqsafe_cpu_dec(pcp)	irqsafe_cpu_sub((pcp), 1)
+#endif
+
+#ifndef irqsafe_cpu_and
+# ifndef irqsafe_cpu_and_1
+#  define irqsafe_cpu_and_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef irqsafe_cpu_and_2
+#  define irqsafe_cpu_and_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef irqsafe_cpu_and_4
+#  define irqsafe_cpu_and_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef irqsafe_cpu_and_8
+#  define irqsafe_cpu_and_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# define irqsafe_cpu_and(pcp, val) __size_call(irqsafe_cpu_and_, (val))
+#endif
+
+#ifndef irqsafe_cpu_or
+# ifndef irqsafe_cpu_or_1
+#  define irqsafe_cpu_or_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef irqsafe_cpu_or_2
+#  define irqsafe_cpu_or_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef irqsafe_cpu_or_4
+#  define irqsafe_cpu_or_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef irqsafe_cpu_or_8
+#  define irqsafe_cpu_or_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# define irqsafe_cpu_or(pcp, val) __size_call(irqsafe_cpu_or_, (val))
+#endif
+
+#ifndef irqsafe_cpu_xor
+# ifndef irqsafe_cpu_xor_1
+#  define irqsafe_cpu_xor_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef irqsafe_cpu_xor_2
+#  define irqsafe_cpu_xor_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef irqsafe_cpu_xor_4
+#  define irqsafe_cpu_xor_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef irqsafe_cpu_xor_8
+#  define irqsafe_cpu_xor_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# define irqsafe_cpu_xor(pcp, val) __size_call(irqsafe_cpu_xor_, (val))
+#endif
+
 #endif /* __LINUX_PERCPU_H */
Index: linux-2.6/include/asm-generic/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-generic/percpu.h	2009-10-01 16:14:48.000000000 -0500
+++ linux-2.6/include/asm-generic/percpu.h	2009-10-02 13:11:42.000000000 -0500
@@ -56,6 +56,9 @@ extern unsigned long __per_cpu_offset[NR
 #define __raw_get_cpu_var(var) \
 	(*SHIFT_PERCPU_PTR(&per_cpu_var(var), __my_cpu_offset))
 
+#define this_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, my_cpu_offset)
+#define __this_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
+
 
 #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void setup_per_cpu_areas(void);
@@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
 #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu_var(var)))
 #define __get_cpu_var(var)			per_cpu_var(var)
 #define __raw_get_cpu_var(var)			per_cpu_var(var)
+#define this_cpu_ptr(ptr)			per_cpu_ptr(ptr, 0)
+#define __this_cpu_ptr(ptr)			this_cpu_ptr(ptr)
 
 #endif	/* SMP */
 

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 02/19] this_cpu: X86 optimized this_cpu operations
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
  2009-10-06 23:36 ` [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
@ 2009-10-06 23:36 ` cl
  2009-10-06 23:36 ` [this_cpu_xx V5 03/19] Use this_cpu operations for SNMP statistics cl
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:36 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Ingo Molnar, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_x86_ops --]
[-- Type: text/plain, Size: 5914 bytes --]

Basically the existing percpu ops can be used for this_cpu variants that allow
operations also on dynamically allocated percpu data. However, we do not pass a
reference to a percpu variable in. Instead a dynamically or statically
allocated percpu variable is provided.

Preempt, the non preempt and the irqsafe operations generate the same code.
It will always be possible to have the requires per cpu atomicness in a single
RMW instruction with segment override on x86.

64 bit this_cpu operations are not supported on 32 bit.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@elte.hu>

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/include/asm/percpu.h |   78 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2009-10-01 09:08:37.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/percpu.h	2009-10-01 09:29:43.000000000 -0500
@@ -153,6 +153,84 @@ do {							\
 #define percpu_or(var, val)	percpu_to_op("or", per_cpu__##var, val)
 #define percpu_xor(var, val)	percpu_to_op("xor", per_cpu__##var, val)
 
+#define __this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define __this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define __this_cpu_read_4(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+
+#define __this_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_or_1(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_or_2(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_or_4(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define __this_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define __this_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_read_4(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_or_1(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_or_2(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_or_4(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define this_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define this_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define irqsafe_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_or_1(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_or_2(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_or_4(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+/*
+ * Per cpu atomic 64 bit operations are only available under 64 bit.
+ * 32 bit must fall back to generic operations.
+ */
+#ifdef CONFIG_X86_64
+#define __this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define __this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_or_8(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define irqsafe_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#endif
+
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
 #define x86_test_and_clear_bit_percpu(bit, var)				\
 ({									\

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 03/19] Use this_cpu operations for SNMP statistics
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
  2009-10-06 23:36 ` [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
  2009-10-06 23:36 ` [this_cpu_xx V5 02/19] this_cpu: X86 optimized this_cpu operations cl
@ 2009-10-06 23:36 ` cl
  2009-10-06 23:36 ` [this_cpu_xx V5 04/19] Use this_cpu operations for NFS statistics cl
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:36 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_snmp --]
[-- Type: text/plain, Size: 2921 bytes --]

SNMP statistic macros can be signficantly simplified.
This will also reduce code size if the arch supports these operations
in hardware.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/net/snmp.h |   50 ++++++++++++++++++--------------------------------
 1 file changed, 18 insertions(+), 32 deletions(-)

Index: linux-2.6/include/net/snmp.h
===================================================================
--- linux-2.6.orig/include/net/snmp.h	2009-09-30 11:37:26.000000000 -0500
+++ linux-2.6/include/net/snmp.h	2009-09-30 12:57:48.000000000 -0500
@@ -136,45 +136,31 @@ struct linux_xfrm_mib {
 #define SNMP_STAT_BHPTR(name)	(name[0])
 #define SNMP_STAT_USRPTR(name)	(name[1])
 
-#define SNMP_INC_STATS_BH(mib, field) 	\
-	(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field]++)
-#define SNMP_INC_STATS_USER(mib, field) \
-	do { \
-		per_cpu_ptr(mib[1], get_cpu())->mibs[field]++; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_INC_STATS(mib, field) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]++; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_DEC_STATS(mib, field) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]--; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_ADD_STATS(mib, field, addend) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field] += addend; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_ADD_STATS_BH(mib, field, addend) 	\
-	(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field] += addend)
-#define SNMP_ADD_STATS_USER(mib, field, addend) 	\
-	do { \
-		per_cpu_ptr(mib[1], get_cpu())->mibs[field] += addend; \
-		put_cpu(); \
-	} while (0)
+#define SNMP_INC_STATS_BH(mib, field)	\
+			__this_cpu_inc(mib[0]->mibs[field])
+#define SNMP_INC_STATS_USER(mib, field)	\
+			this_cpu_inc(mib[1]->mibs[field])
+#define SNMP_INC_STATS(mib, field)	\
+			this_cpu_inc(mib[!in_softirq()]->mibs[field])
+#define SNMP_DEC_STATS(mib, field)	\
+			this_cpu_dec(mib[!in_softirq()]->mibs[field])
+#define SNMP_ADD_STATS_BH(mib, field, addend)	\
+			__this_cpu_add(mib[0]->mibs[field], addend)
+#define SNMP_ADD_STATS_USER(mib, field, addend)	\
+			this_cpu_add(mib[1]->mibs[field], addend)
 #define SNMP_UPD_PO_STATS(mib, basefield, addend)	\
 	do { \
-		__typeof__(mib[0]) ptr = per_cpu_ptr(mib[!in_softirq()], get_cpu());\
+		__typeof__(mib[0]) ptr; \
+		preempt_disable(); \
+		ptr = this_cpu_ptr((mib)[!in_softirq()]); \
 		ptr->mibs[basefield##PKTS]++; \
 		ptr->mibs[basefield##OCTETS] += addend;\
-		put_cpu(); \
+		preempt_enable(); \
 	} while (0)
 #define SNMP_UPD_PO_STATS_BH(mib, basefield, addend)	\
 	do { \
-		__typeof__(mib[0]) ptr = per_cpu_ptr(mib[!in_softirq()], raw_smp_processor_id());\
+		__typeof__(mib[0]) ptr = \
+			__this_cpu_ptr((mib)[!in_softirq()]); \
 		ptr->mibs[basefield##PKTS]++; \
 		ptr->mibs[basefield##OCTETS] += addend;\
 	} while (0)

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 04/19] Use this_cpu operations for NFS statistics
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (2 preceding siblings ...)
  2009-10-06 23:36 ` [this_cpu_xx V5 03/19] Use this_cpu operations for SNMP statistics cl
@ 2009-10-06 23:36 ` cl
  2009-10-06 23:36 ` [this_cpu_xx V5 05/19] use this_cpu ops for network statistics cl
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:36 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Trond Myklebust, Mel Gorman,
	Pekka Enberg

[-- Attachment #1: this_cpu_nfs --]
[-- Type: text/plain, Size: 1785 bytes --]

Simplify NFS statistics and allow the use of optimized
arch instructions.

Acked-by: Tejun Heo <tj@kernel.org>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/nfs/iostat.h |   24 +++---------------------
 1 file changed, 3 insertions(+), 21 deletions(-)

Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h	2009-09-29 11:57:01.000000000 -0500
+++ linux-2.6/fs/nfs/iostat.h	2009-09-29 12:26:42.000000000 -0500
@@ -25,13 +25,7 @@ struct nfs_iostats {
 static inline void nfs_inc_server_stats(const struct nfs_server *server,
 					enum nfs_stat_eventcounters stat)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(server->io_stats, cpu);
-	iostats->events[stat]++;
-	put_cpu();
+	this_cpu_inc(server->io_stats->events[stat]);
 }
 
 static inline void nfs_inc_stats(const struct inode *inode,
@@ -44,13 +38,7 @@ static inline void nfs_add_server_stats(
 					enum nfs_stat_bytecounters stat,
 					unsigned long addend)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(server->io_stats, cpu);
-	iostats->bytes[stat] += addend;
-	put_cpu();
+	this_cpu_add(server->io_stats->bytes[stat], addend);
 }
 
 static inline void nfs_add_stats(const struct inode *inode,
@@ -65,13 +53,7 @@ static inline void nfs_add_fscache_stats
 					 enum nfs_stat_fscachecounters stat,
 					 unsigned long addend)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(NFS_SERVER(inode)->io_stats, cpu);
-	iostats->fscache[stat] += addend;
-	put_cpu();
+	this_cpu_add(NFS_SERVER(inode)->io_stats->fscache[stat], addend);
 }
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 05/19] use this_cpu ops for network statistics
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (3 preceding siblings ...)
  2009-10-06 23:36 ` [this_cpu_xx V5 04/19] Use this_cpu operations for NFS statistics cl
@ 2009-10-06 23:36 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 06/19] this_cpu_ptr: Straight transformations cl
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:36 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, David Miller, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_net --]
[-- Type: text/plain, Size: 1760 bytes --]

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/net/neighbour.h              |    7 +------
 include/net/netfilter/nf_conntrack.h |    4 ++--
 2 files changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6/include/net/neighbour.h
===================================================================
--- linux-2.6.orig/include/net/neighbour.h	2009-09-30 18:32:31.000000000 -0500
+++ linux-2.6/include/net/neighbour.h	2009-09-30 18:32:55.000000000 -0500
@@ -90,12 +90,7 @@ struct neigh_statistics
 	unsigned long unres_discards;	/* number of unresolved drops */
 };
 
-#define NEIGH_CACHE_STAT_INC(tbl, field)				\
-	do {								\
-		preempt_disable();					\
-		(per_cpu_ptr((tbl)->stats, smp_processor_id())->field)++; \
-		preempt_enable();					\
-	} while (0)
+#define NEIGH_CACHE_STAT_INC(tbl, field) this_cpu_inc((tbl)->stats->field)
 
 struct neighbour
 {
Index: linux-2.6/include/net/netfilter/nf_conntrack.h
===================================================================
--- linux-2.6.orig/include/net/netfilter/nf_conntrack.h	2009-09-30 18:32:57.000000000 -0500
+++ linux-2.6/include/net/netfilter/nf_conntrack.h	2009-09-30 18:34:13.000000000 -0500
@@ -295,11 +295,11 @@ extern unsigned int nf_conntrack_htable_
 extern unsigned int nf_conntrack_max;
 
 #define NF_CT_STAT_INC(net, count)	\
-	(per_cpu_ptr((net)->ct.stat, raw_smp_processor_id())->count++)
+	__this_cpu_inc((net)->ct.stat->count)
 #define NF_CT_STAT_INC_ATOMIC(net, count)		\
 do {							\
 	local_bh_disable();				\
-	per_cpu_ptr((net)->ct.stat, raw_smp_processor_id())->count++;	\
+	__this_cpu_inc((net)->ct.stat->count);		\
 	local_bh_enable();				\
 } while (0)
 

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 06/19] this_cpu_ptr: Straight transformations
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (4 preceding siblings ...)
  2009-10-06 23:36 ` [this_cpu_xx V5 05/19] use this_cpu ops for network statistics cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 07/19] this_cpu_ptr: Eliminate get/put_cpu cl
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, David Howells, Tejun Heo, Ingo Molnar,
	Rusty Russell, Eric Dumazet, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_straight_transforms --]
[-- Type: text/plain, Size: 3566 bytes --]

Use this_cpu_ptr and __this_cpu_ptr in locations where straight
transformations are possible because per_cpu_ptr is used with
either smp_processor_id() or raw_smp_processor_id().

cc: David Howells <dhowells@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 drivers/infiniband/hw/ehca/ehca_irq.c |    3 +--
 drivers/net/chelsio/sge.c             |    5 ++---
 drivers/net/loopback.c                |    2 +-
 fs/ext4/mballoc.c                     |    2 +-
 4 files changed, 5 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/net/chelsio/sge.c
===================================================================
--- linux-2.6.orig/drivers/net/chelsio/sge.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/drivers/net/chelsio/sge.c	2009-09-29 11:39:20.000000000 -0500
@@ -1378,7 +1378,7 @@ static void sge_rx(struct sge *sge, stru
 	}
 	__skb_pull(skb, sizeof(*p));
 
-	st = per_cpu_ptr(sge->port_stats[p->iff], smp_processor_id());
+	st = this_cpu_ptr(sge->port_stats[p->iff]);
 
 	skb->protocol = eth_type_trans(skb, adapter->port[p->iff].dev);
 	if ((adapter->flags & RX_CSUM_ENABLED) && p->csum == 0xffff &&
@@ -1780,8 +1780,7 @@ netdev_tx_t t1_start_xmit(struct sk_buff
 {
 	struct adapter *adapter = dev->ml_priv;
 	struct sge *sge = adapter->sge;
-	struct sge_port_stats *st = per_cpu_ptr(sge->port_stats[dev->if_port],
-						smp_processor_id());
+	struct sge_port_stats *st = this_cpu_ptr(sge->port_stats[dev->if_port]);
 	struct cpl_tx_pkt *cpl;
 	struct sk_buff *orig_skb = skb;
 	int ret;
Index: linux-2.6/drivers/net/loopback.c
===================================================================
--- linux-2.6.orig/drivers/net/loopback.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/drivers/net/loopback.c	2009-09-29 11:39:20.000000000 -0500
@@ -81,7 +81,7 @@ static netdev_tx_t loopback_xmit(struct 
 
 	/* it's OK to use per_cpu_ptr() because BHs are off */
 	pcpu_lstats = dev->ml_priv;
-	lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
+	lb_stats = this_cpu_ptr(pcpu_lstats);
 
 	len = skb->len;
 	if (likely(netif_rx(skb) == NET_RX_SUCCESS)) {
Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/fs/ext4/mballoc.c	2009-09-29 11:39:20.000000000 -0500
@@ -4210,7 +4210,7 @@ static void ext4_mb_group_or_file(struct
 	 * per cpu locality group is to reduce the contention between block
 	 * request from multiple CPUs.
 	 */
-	ac->ac_lg = per_cpu_ptr(sbi->s_locality_groups, raw_smp_processor_id());
+	ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups);
 
 	/* we're going to use group allocation */
 	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_irq.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c	2009-09-29 11:39:20.000000000 -0500
@@ -826,8 +826,7 @@ static void __cpuinit take_over_work(str
 		cq = list_entry(cct->cq_list.next, struct ehca_cq, entry);
 
 		list_del(&cq->entry);
-		__queue_comp_task(cq, per_cpu_ptr(pool->cpu_comp_tasks,
-						  smp_processor_id()));
+		__queue_comp_task(cq, this_cpu_ptr(pool->cpu_comp_tasks));
 	}
 
 	spin_unlock_irqrestore(&cct->task_lock, flags_cct);

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 07/19] this_cpu_ptr: Eliminate get/put_cpu
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (5 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 06/19] this_cpu_ptr: Straight transformations cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 09/19] Use this_cpu_ptr in crypto subsystem cl
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Maciej Sosnowski, Dan Williams, Tejun Heo,
	Eric Biederman, Stephen Hemminger, David L Stevens, Mel Gorman,
	Pekka Enberg

[-- Attachment #1: this_cpu_ptr_eliminate_get_put_cpu --]
[-- Type: text/plain, Size: 4488 bytes --]

There are cases where we can use this_cpu_ptr and as the result
of using this_cpu_ptr() we no longer need to determine the
currently executing cpu.

In those places no get/put_cpu combination is needed anymore.
The local cpu variable can be eliminated.

Preemption still needs to be disabled and enabled since the
modifications of the per cpu variables is not atomic. There may
be multiple per cpu variables modified and those must all
be from the same processor.

Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Eric Biederman <ebiederm@aristanetworks.com>
cc: Stephen Hemminger <shemminger@vyatta.com>
cc: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 drivers/dma/dmaengine.c |   36 +++++++++++++-----------------------
 drivers/net/veth.c      |    7 +++----
 2 files changed, 16 insertions(+), 27 deletions(-)

Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c	2009-09-28 10:08:09.000000000 -0500
+++ linux-2.6/drivers/dma/dmaengine.c	2009-09-29 09:01:54.000000000 -0500
@@ -326,14 +326,7 @@ arch_initcall(dma_channel_table_init);
  */
 struct dma_chan *dma_find_channel(enum dma_transaction_type tx_type)
 {
-	struct dma_chan *chan;
-	int cpu;
-
-	cpu = get_cpu();
-	chan = per_cpu_ptr(channel_table[tx_type], cpu)->chan;
-	put_cpu();
-
-	return chan;
+	return this_cpu_read(channel_table[tx_type]->chan);
 }
 EXPORT_SYMBOL(dma_find_channel);
 
@@ -847,7 +840,6 @@ dma_async_memcpy_buf_to_buf(struct dma_c
 	struct dma_async_tx_descriptor *tx;
 	dma_addr_t dma_dest, dma_src;
 	dma_cookie_t cookie;
-	int cpu;
 	unsigned long flags;
 
 	dma_src = dma_map_single(dev->dev, src, len, DMA_TO_DEVICE);
@@ -866,10 +858,10 @@ dma_async_memcpy_buf_to_buf(struct dma_c
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
+	preempt_disable();
+	__this_cpu_add(chan->local->bytes_transferred, len);
+	__this_cpu_inc(chan->local->memcpy_count);
+	preempt_enable();
 
 	return cookie;
 }
@@ -896,7 +888,6 @@ dma_async_memcpy_buf_to_pg(struct dma_ch
 	struct dma_async_tx_descriptor *tx;
 	dma_addr_t dma_dest, dma_src;
 	dma_cookie_t cookie;
-	int cpu;
 	unsigned long flags;
 
 	dma_src = dma_map_single(dev->dev, kdata, len, DMA_TO_DEVICE);
@@ -913,10 +904,10 @@ dma_async_memcpy_buf_to_pg(struct dma_ch
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
+	preempt_disable();
+	__this_cpu_add(chan->local->bytes_transferred, len);
+	__this_cpu_inc(chan->local->memcpy_count);
+	preempt_enable();
 
 	return cookie;
 }
@@ -945,7 +936,6 @@ dma_async_memcpy_pg_to_pg(struct dma_cha
 	struct dma_async_tx_descriptor *tx;
 	dma_addr_t dma_dest, dma_src;
 	dma_cookie_t cookie;
-	int cpu;
 	unsigned long flags;
 
 	dma_src = dma_map_page(dev->dev, src_pg, src_off, len, DMA_TO_DEVICE);
@@ -963,10 +953,10 @@ dma_async_memcpy_pg_to_pg(struct dma_cha
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
+	preempt_disable();
+	__this_cpu_add(chan->local->bytes_transferred, len);
+	__this_cpu_inc(chan->local->memcpy_count);
+	preempt_enable();
 
 	return cookie;
 }
Index: linux-2.6/drivers/net/veth.c
===================================================================
--- linux-2.6.orig/drivers/net/veth.c	2009-09-17 17:54:16.000000000 -0500
+++ linux-2.6/drivers/net/veth.c	2009-09-29 09:01:54.000000000 -0500
@@ -153,7 +153,7 @@ static netdev_tx_t veth_xmit(struct sk_b
 	struct net_device *rcv = NULL;
 	struct veth_priv *priv, *rcv_priv;
 	struct veth_net_stats *stats, *rcv_stats;
-	int length, cpu;
+	int length;
 
 	skb_orphan(skb);
 
@@ -161,9 +161,8 @@ static netdev_tx_t veth_xmit(struct sk_b
 	rcv = priv->peer;
 	rcv_priv = netdev_priv(rcv);
 
-	cpu = smp_processor_id();
-	stats = per_cpu_ptr(priv->stats, cpu);
-	rcv_stats = per_cpu_ptr(rcv_priv->stats, cpu);
+	stats = this_cpu_ptr(priv->stats);
+	rcv_stats = this_cpu_ptr(rcv_priv->stats);
 
 	if (!(rcv->flags & IFF_UP))
 		goto tx_drop;

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 09/19] Use this_cpu_ptr in crypto subsystem
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (6 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 07/19] this_cpu_ptr: Eliminate get/put_cpu cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 10/19] Use this_cpu ops for VM statistics cl
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Huang Ying, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_crypto --]
[-- Type: text/plain, Size: 948 bytes --]

Just a slight optimization that removes one array lookup.
The processor number is needed for other things as well so the
get/put_cpu cannot be removed.

Acked-by: Tejun Heo <tj@kernel.org>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 crypto/cryptd.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/crypto/cryptd.c
===================================================================
--- linux-2.6.orig/crypto/cryptd.c	2009-09-14 08:47:15.000000000 -0500
+++ linux-2.6/crypto/cryptd.c	2009-09-15 13:47:11.000000000 -0500
@@ -99,7 +99,7 @@ static int cryptd_enqueue_request(struct
 	struct cryptd_cpu_queue *cpu_queue;
 
 	cpu = get_cpu();
-	cpu_queue = per_cpu_ptr(queue->cpu_queue, cpu);
+	cpu_queue = this_cpu_ptr(queue->cpu_queue);
 	err = crypto_enqueue_request(&cpu_queue->queue, request);
 	queue_work_on(cpu, kcrypto_wq, &cpu_queue->work);
 	put_cpu();

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 10/19] Use this_cpu ops for VM statistics.
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (7 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 09/19] Use this_cpu_ptr in crypto subsystem cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 11/19] RCU: Use this_cpu operations cl
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_vmstats --]
[-- Type: text/plain, Size: 1525 bytes --]

Using per cpu atomics for the vm statistics reduces their overhead.
And in the case of x86 we are guaranteed that they will never race even
in the lax form used for vm statistics.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/vmstat.h |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2009-08-14 10:07:18.000000000 -0500
+++ linux-2.6/include/linux/vmstat.h	2009-09-01 14:59:22.000000000 -0500
@@ -76,24 +76,22 @@ DECLARE_PER_CPU(struct vm_event_state, v
 
 static inline void __count_vm_event(enum vm_event_item item)
 {
-	__get_cpu_var(vm_event_states).event[item]++;
+	__this_cpu_inc(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void count_vm_event(enum vm_event_item item)
 {
-	get_cpu_var(vm_event_states).event[item]++;
-	put_cpu();
+	this_cpu_inc(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void __count_vm_events(enum vm_event_item item, long delta)
 {
-	__get_cpu_var(vm_event_states).event[item] += delta;
+	__this_cpu_add(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 static inline void count_vm_events(enum vm_event_item item, long delta)
 {
-	get_cpu_var(vm_event_states).event[item] += delta;
-	put_cpu();
+	this_cpu_add(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 extern void all_vm_events(unsigned long *);

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 11/19] RCU: Use this_cpu operations
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (8 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 10/19] Use this_cpu ops for VM statistics cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 12/19] this_cpu_ops: page allocator conversion cl
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Paul E. McKenney, Mel Gorman,
	Pekka Enberg

[-- Attachment #1: this_cpu_rcu --]
[-- Type: text/plain, Size: 1916 bytes --]

RCU does not do dynamic allocations but it increments per cpu variables
a lot. These instructions results in a move to a register and then back
to memory. This patch will make it use the inc/dec instructions on x86
that do not need a register.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 kernel/rcutorture.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/rcutorture.c
===================================================================
--- linux-2.6.orig/kernel/rcutorture.c	2009-09-28 10:08:10.000000000 -0500
+++ linux-2.6/kernel/rcutorture.c	2009-09-29 09:02:00.000000000 -0500
@@ -731,13 +731,13 @@ static void rcu_torture_timer(unsigned l
 		/* Should not happen, but... */
 		pipe_count = RCU_TORTURE_PIPE_LEN;
 	}
-	++__get_cpu_var(rcu_torture_count)[pipe_count];
+	__this_cpu_inc(per_cpu_var(rcu_torture_count)[pipe_count]);
 	completed = cur_ops->completed() - completed;
 	if (completed > RCU_TORTURE_PIPE_LEN) {
 		/* Should not happen, but... */
 		completed = RCU_TORTURE_PIPE_LEN;
 	}
-	++__get_cpu_var(rcu_torture_batch)[completed];
+	__this_cpu_inc(per_cpu_var(rcu_torture_batch)[completed]);
 	preempt_enable();
 	cur_ops->readunlock(idx);
 }
@@ -786,13 +786,13 @@ rcu_torture_reader(void *arg)
 			/* Should not happen, but... */
 			pipe_count = RCU_TORTURE_PIPE_LEN;
 		}
-		++__get_cpu_var(rcu_torture_count)[pipe_count];
+		__this_cpu_inc(per_cpu_var(rcu_torture_count)[pipe_count]);
 		completed = cur_ops->completed() - completed;
 		if (completed > RCU_TORTURE_PIPE_LEN) {
 			/* Should not happen, but... */
 			completed = RCU_TORTURE_PIPE_LEN;
 		}
-		++__get_cpu_var(rcu_torture_batch)[completed];
+		__this_cpu_inc(per_cpu_var(rcu_torture_batch)[completed]);
 		preempt_enable();
 		cur_ops->readunlock(idx);
 		schedule();

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 12/19] this_cpu_ops: page allocator conversion
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (9 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 11/19] RCU: Use this_cpu operations cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 13/19] this_cpu ops: Remove pageset_notifier cl
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mel Gorman, Tejun Heo, Pekka Enberg

[-- Attachment #1: this_cpu_page_allocator --]
[-- Type: text/plain, Size: 14169 bytes --]

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  187 ++++++++++++++++++-------------------------------
 mm/vmstat.c            |   14 ++-
 4 files changed, 81 insertions(+), 136 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-10-06 18:08:05.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-10-06 18:12:28.000000000 -0500
@@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 extern void zone_pcp_update(struct zone *zone);
 
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-10-06 18:08:05.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h	2009-10-06 18:12:28.000000000 -0500
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-06 18:08:05.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-06 18:17:53.000000000 -0500
@@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		local_irq_save(flags);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
-		local_irq_save(flags);
 		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
 
 out:
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
-		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		list = &pcp->lists[migratetype];
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
@@ -1234,7 +1231,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -2172,7 +2167,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -2735,10 +2730,29 @@ static void build_zonelist_cache(pg_data
 
 #endif	/* CONFIG_NUMA */
 
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+
 /* return values int ....just for stop_machine() */
 static int __build_all_zonelists(void *dummy)
 {
 	int nid;
+	int cpu;
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -2749,6 +2763,14 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 	}
+
+	/*
+	 * Initialize the boot_pagesets that are going to be used
+	 * for bootstrapping processors.
+	 */
+	for_each_possible_cpu(cpu)
+		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+
 	return 0;
 }
 
@@ -3087,120 +3109,60 @@ static void setup_pagelist_highmark(stru
 }
 
 
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
-
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
 	}
-	return ret;
+	return NOTIFY_OK;
 }
 
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
 
+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;
+
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);
+
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
 
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }
 
-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3254,7 +3216,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;
 
 		local_irq_save(flags);
@@ -3272,21 +3234,13 @@ void zone_pcp_update(struct zone *zone)
 
 static __meminit void zone_pcp_init(struct zone *zone)
 {
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
+	/* Use boot pagesets until we have the per cpu allocator up */
+	zone->pageset = &per_cpu_var(boot_pageset);
 
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
 	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
+			zone->name, zone->present_pages,
+					 zone_batchsize(zone));
 }
 
 __meminit int init_currently_empty_zone(struct zone *zone,
@@ -4800,10 +4754,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-06 18:08:05.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-06 18:12:28.000000000 -0500
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
 
-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 13/19] this_cpu ops: Remove pageset_notifier
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (10 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 12/19] this_cpu_ops: page allocator conversion cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 14/19] Use this_cpu operations in slub cl
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_remove_pageset_notifier --]
[-- Type: text/plain, Size: 2015 bytes --]

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/page_alloc.c |   28 ----------------------------
 mm/vmstat.c     |    1 +
 2 files changed, 1 insertion(+), 28 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-06 18:19:20.000000000 -0500
@@ -906,6 +906,7 @@ static int __cpuinit vmstat_cpuup_callba
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		start_cpu_timer(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-06 18:19:21.000000000 -0500
@@ -3108,27 +3108,6 @@ static void setup_pagelist_highmark(stru
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		node_set_state(cpu_to_node(cpu), N_CPU);
-		break;
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
@@ -3154,13 +3133,6 @@ void __init setup_per_cpu_pageset(void)
 						percpu_pagelist_fraction));
 		}
 	}
-
-	/*
-	 * The boot cpu is always the first active.
-	 * The boot node has a processor
-	 */
-	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
-	register_cpu_notifier(&pageset_notifier);
 }
 
 static noinline __init_refok

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 14/19] Use this_cpu operations in slub
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (11 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 13/19] this_cpu ops: Remove pageset_notifier cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 15/19] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman

[-- Attachment #1: this_cpu_slub_conversion --]
[-- Type: text/plain, Size: 12709 bytes --]

Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases since per cpu pointer lookups and
   address calculations are avoided.

V2->V3:
- Leave Linus' code ornament alone.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  207 ++++++++++-------------------------------------
 2 files changed, 49 insertions(+), 164 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-17 17:51:51.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 09:02:05.000000000 -0500
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-28 10:08:10.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 09:02:05.000000000 -0500
@@ -242,15 +242,6 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1124,7 +1115,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1422,7 +1413,7 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
@@ -1454,7 +1445,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1673,7 +1661,7 @@ new_slab:
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __this_cpu_ptr(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1711,7 +1699,6 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned int objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1722,24 +1709,23 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
-	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	c = __this_cpu_ptr(s->cpu_slab);
+	object = c->freelist;
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
-		memset(object, 0, objsize);
+		memset(object, 0, s->objsize);
 
 	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, objsize, 1, s->flags, gfpflags);
+	kmemleak_alloc_recursive(object, c->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1800,7 +1786,7 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1872,7 +1858,7 @@ static __always_inline void slab_free(st
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, c->objsize);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
@@ -2095,130 +2081,28 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu [NR_KMEM_CACHE_CPU],
-		      kmem_cache_cpu);
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static DECLARE_BITMAP(kmem_cach_cpu_free_init_once, CONFIG_NR_CPUS);
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c >= per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[SLUB_PAGE_SHIFT]);
 
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
-static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpumask_test_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once)))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpumask_set_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once));
-}
-
-static void __init init_alloc_cpu(void)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
+	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+		/*
+		 * Boot time creation of the kmalloc array. Use static per cpu data
+		 * since the per cpu allocator is not available yet.
+		 */
+		s->cpu_slab = per_cpu_var(kmalloc_percpu) + (s - kmalloc_caches);
+	else
+		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
+	if (!s->cpu_slab)
+		return 0;
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2609,9 +2493,8 @@ static inline int kmem_cache_close(struc
 	int node;
 
 	flush_all(s);
-
+	free_percpu(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2760,7 +2643,19 @@ static noinline struct kmem_cache *dma_k
 	realsize = kmalloc_caches[index].objsize;
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
-	s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+
+	if (flags & __GFP_WAIT)
+		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	else {
+		int i;
+
+		s = NULL;
+		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
+			if (kmalloc_caches[i].size) {
+				s = kmalloc_caches + i;
+				break;
+			}
+	}
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -3176,8 +3071,6 @@ void __init kmem_cache_init(void)
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3261,8 +3154,10 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3365,7 +3260,7 @@ struct kmem_cache *kmem_cache_create(con
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3422,11 +3317,9 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3436,13 +3329,9 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3928,7 +3817,7 @@ static ssize_t show_slab_objects(struct 
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4353,7 +4242,7 @@ static int show_stat(struct kmem_cache *
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 15/19] SLUB: Get rid of dynamic DMA kmalloc cache allocation
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (12 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 14/19] Use this_cpu operations in slub cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 16/19] this_cpu: Remove slub kmem_cache fields cl
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Mel Gorman, Pekka Enberg

[-- Attachment #1: this_cpu_slub_static_dma_kmalloc --]
[-- Type: text/plain, Size: 3687 bytes --]

Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   19 +++++++++++--------
 mm/slub.c                |   24 ++++++++++--------------
 2 files changed, 21 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 11:43:18.000000000 -0500
@@ -131,11 +131,21 @@ struct kmem_cache {
 
 #define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
 
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
 	return &kmalloc_caches[index];
 }
 
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:43:18.000000000 -0500
@@ -2090,7 +2090,7 @@ static inline int alloc_kmem_cache_cpus(
 {
 	int cpu;
 
-	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
 		 * since the per cpu allocator is not available yet.
@@ -2537,7 +2537,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2627,6 +2627,7 @@ static noinline struct kmem_cache *dma_k
 	char *text;
 	size_t realsize;
 	unsigned long slabflags;
+	int i;
 
 	s = kmalloc_caches_dma[index];
 	if (s)
@@ -2647,18 +2648,13 @@ static noinline struct kmem_cache *dma_k
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
 
-	if (flags & __GFP_WAIT)
-		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
-	else {
-		int i;
+	s = NULL;
+	for (i = 0; i < KMALLOC_CACHES; i++)
+		if (kmalloc_caches[i].size)
+			break;
 
-		s = NULL;
-		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
-			if (kmalloc_caches[i].size) {
-				s = kmalloc_caches + i;
-				break;
-			}
-	}
+	BUG_ON(i >= KMALLOC_CACHES);
+	s = kmalloc_caches + i;
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -2672,7 +2668,7 @@ static noinline struct kmem_cache *dma_k
 
 	if (!s || !text || !kmem_cache_open(s, flags, text,
 			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		kfree(s);
+		s->size = 0;
 		kfree(text);
 		goto unlock_out;
 	}

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 16/19] this_cpu: Remove slub kmem_cache fields
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (13 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 15/19] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 17/19] Make slub statistics use this_cpu_inc cl
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman

[-- Attachment #1: this_cpu_slub_remove_fields --]
[-- Type: text/plain, Size: 7809 bytes --]

Remove the fields in struct kmem_cache_cpu that were used to cache data from
struct kmem_cache when they were in different cachelines. The cacheline that
holds the per cpu array pointer now also holds these values. We can cut down
the struct kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the size field does not require accessing an additional cacheline anymore.
This results in consistent use of functions for setting the freepointer of
objects throughout SLUB.

Also we initialize all possible kmem_cache_cpu structures when a slab is
created. No need to initialize them when a processor or node comes online.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   81 +++++++++++++----------------------------------
 2 files changed, 24 insertions(+), 59 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-29 11:44:03.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 11:44:35.000000000 -0500
@@ -38,8 +38,6 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:03.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
@@ -260,13 +260,6 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1473,10 +1466,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1635,7 +1628,7 @@ load_freelist:
 	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1681,7 +1674,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1719,7 +1712,7 @@ static __always_inline void *slab_alloc(
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1727,8 +1720,8 @@ static __always_inline void *slab_alloc(
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, s->objsize);
 
-	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, c->objsize, 1, s->flags, gfpflags);
+	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1783,7 +1776,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_notr
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr, unsigned int offset)
+			void *x, unsigned long addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1797,7 +1790,8 @@ static void __slab_free(struct kmem_cach
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1862,16 +1856,16 @@ static __always_inline void slab_free(st
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	kmemcheck_slab_free(s, object, c->objsize);
-	debug_check_no_locks_freed(object, c->objsize);
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, c->objsize);
+		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -2058,19 +2052,6 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
-{
-	c->page = NULL;
-	c->freelist = NULL;
-	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-#ifdef CONFIG_SLUB_STATS
-	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
-#endif
-}
-
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
@@ -2088,8 +2069,6 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
-	int cpu;
-
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2102,8 +2081,6 @@ static inline int alloc_kmem_cache_cpus(
 	if (!s->cpu_slab)
 		return 0;
 
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
 
@@ -2387,8 +2364,16 @@ static int kmem_cache_open(struct kmem_c
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
 
-	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+	if (!alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+		return 0;
+
+	/*
+	 * gfp_flags would be flags & ~SLUB_DMA but the per cpu
+	 * allocator does not support it.
+	 */
+	if (s->cpu_slab)
 		return 1;
+
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
@@ -3245,22 +3230,12 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
-
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 
@@ -3314,14 +3289,6 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
-			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
-		up_read(&slub_lock);
-		break;
-
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 17/19] Make slub statistics use this_cpu_inc
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (14 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 16/19] this_cpu: Remove slub kmem_cache fields cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 18/19] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
  2009-10-06 23:37 ` [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable cl
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman

[-- Attachment #1: this_cpu_slub_cleanup_stat --]
[-- Type: text/plain, Size: 5136 bytes --]

this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?

---
 mm/slub.c |   43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	c->stat[si]++;
+	__this_cpu_inc(s->cpu_slab->stat[si]);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+		stat(s, ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
 
 		if (page->freelist) {
 			add_partial(n, page, tail);
-			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
-			stat(c, DEACTIVATE_FULL);
+			stat(s, DEACTIVATE_FULL);
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
 	} else {
-		stat(c, DEACTIVATE_EMPTY);
+		stat(s, DEACTIVATE_EMPTY);
 		if (n->nr_partial < s->min_partial) {
 			/*
 			 * Adding an empty slab to the partial slabs in order
@@ -1438,7 +1437,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
+			stat(s, FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1453,7 +1452,7 @@ static void deactivate_slab(struct kmem_
 	int tail = 1;
 
 	if (page->freelist)
-		stat(c, DEACTIVATE_REMOTE_FREES);
+		stat(s, DEACTIVATE_REMOTE_FREES);
 	/*
 	 * Merge cpu freelist into slab freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
@@ -1479,7 +1478,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(c, CPUSLAB_FLUSH);
+	stat(s, CPUSLAB_FLUSH);
 	slab_lock(c->page);
 	deactivate_slab(s, c);
 }
@@ -1619,7 +1618,7 @@ static void *__slab_alloc(struct kmem_ca
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 
-	stat(c, ALLOC_REFILL);
+	stat(s, ALLOC_REFILL);
 
 load_freelist:
 	object = c->page->freelist;
@@ -1634,7 +1633,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-	stat(c, ALLOC_SLOWPATH);
+	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
@@ -1644,7 +1643,7 @@ new_slab:
 	new = get_partial(s, gfpflags, node);
 	if (new) {
 		c->page = new;
-		stat(c, ALLOC_FROM_PARTIAL);
+		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
 
@@ -1658,7 +1657,7 @@ new_slab:
 
 	if (new) {
 		c = __this_cpu_ptr(s->cpu_slab);
-		stat(c, ALLOC_SLAB);
+		stat(s, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
 		slab_lock(new);
@@ -1713,7 +1712,7 @@ static __always_inline void *slab_alloc(
 
 	else {
 		c->freelist = get_freepointer(s, object);
-		stat(c, ALLOC_FASTPATH);
+		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
@@ -1780,10 +1779,8 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 
-	c = __this_cpu_ptr(s->cpu_slab);
-	stat(c, FREE_SLOWPATH);
+	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
@@ -1796,7 +1793,7 @@ checks_ok:
 	page->inuse--;
 
 	if (unlikely(PageSlubFrozen(page))) {
-		stat(c, FREE_FROZEN);
+		stat(s, FREE_FROZEN);
 		goto out_unlock;
 	}
 
@@ -1809,7 +1806,7 @@ checks_ok:
 	 */
 	if (unlikely(!prior)) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(c, FREE_ADD_PARTIAL);
+		stat(s, FREE_ADD_PARTIAL);
 	}
 
 out_unlock:
@@ -1822,10 +1819,10 @@ slab_empty:
 		 * Slab still on the partial list.
 		 */
 		remove_partial(s, page);
-		stat(c, FREE_REMOVE_PARTIAL);
+		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-	stat(c, FREE_SLAB);
+	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
 
@@ -1863,7 +1860,7 @@ static __always_inline void slab_free(st
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
-		stat(c, FREE_FASTPATH);
+		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 18/19] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (15 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 17/19] Make slub statistics use this_cpu_inc cl
@ 2009-10-06 23:37 ` cl
  2009-10-06 23:37 ` [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable cl
  17 siblings, 0 replies; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mathieu Desnoyers, Pekka Enberg, Tejun Heo,
	Mel Gorman

[-- Attachment #1: this_cpu_slub_aggressive_cpu_ops --]
[-- Type: text/plain, Size: 5869 bytes --]

Use this_cpu_* operations in the hotpath to avoid calculations of
kmem_cache_cpu pointer addresses.

On x86 there is a tradeof: Multiple uses segment prefixes against an
address calculation and more register pressure. Code size is reduced
therefore it is an advantage.

The use of prefixes is necessary if we want to use
Mathieus' scheme for fastpaths that do not require disabling
interrupts.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   80 ++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 39 insertions(+), 41 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-30 15:58:20.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-30 16:24:45.000000000 -0500
@@ -1512,10 +1512,10 @@ static void flush_all(struct kmem_cache 
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
  */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int node_match(struct kmem_cache *s, int node)
 {
 #ifdef CONFIG_NUMA
-	if (node != -1 && c->node != node)
+	if (node != -1 && __this_cpu_read(s->cpu_slab->node) != node)
 		return 0;
 #endif
 	return 1;
@@ -1603,46 +1603,46 @@ slab_out_of_memory(struct kmem_cache *s,
  * a call to the page allocator and the setup of a new slab.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr)
 {
 	void **object;
-	struct page *new;
+	struct page *page = __this_cpu_read(s->cpu_slab->page);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
-	if (!c->page)
+	if (!page)
 		goto new_slab;
 
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	slab_lock(page);
+	if (unlikely(!node_match(s, node)))
 		goto another_slab;
 
 	stat(s, ALLOC_REFILL);
 
 load_freelist:
-	object = c->page->freelist;
+	object = page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
 		goto debug;
 
-	c->freelist = get_freepointer(s, object);
-	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
-	c->node = page_to_nid(c->page);
+	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
+	page->inuse = page->objects;
+	page->freelist = NULL;
+	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
 unlock_out:
-	slab_unlock(c->page);
+	slab_unlock(page);
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
-	deactivate_slab(s, c);
+	deactivate_slab(s, __this_cpu_ptr(s->cpu_slab));
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
-	if (new) {
-		c->page = new;
+	page = get_partial(s, gfpflags, node);
+	if (page) {
+		__this_cpu_write(s->cpu_slab->page, page);
 		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
@@ -1650,31 +1650,30 @@ new_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	page = new_slab(s, gfpflags, node);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
-	if (new) {
-		c = __this_cpu_ptr(s->cpu_slab);
+	if (page) {
 		stat(s, ALLOC_SLAB);
-		if (c->page)
-			flush_slab(s, c);
-		slab_lock(new);
-		__SetPageSlubFrozen(new);
-		c->page = new;
+		if (__this_cpu_read(s->cpu_slab->page))
+			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
+		slab_lock(page);
+		__SetPageSlubFrozen(page);
+		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
 debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (!alloc_debug_processing(s, page, object, addr))
 		goto another_slab;
 
-	c->page->inuse++;
-	c->page->freelist = get_freepointer(s, object);
-	c->node = -1;
+	page->inuse++;
+	page->freelist = get_freepointer(s, object);
+	__this_cpu_write(s->cpu_slab->node, -1);
 	goto unlock_out;
 }
 
@@ -1692,7 +1691,6 @@ static __always_inline void *slab_alloc(
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	gfpflags &= gfp_allowed_mask;
@@ -1704,14 +1702,14 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
-	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	object = __this_cpu_read(s->cpu_slab->freelist);
+	if (unlikely(!object || !node_match(s, node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, node, addr);
 
 	else {
-		c->freelist = get_freepointer(s, object);
+		__this_cpu_write(s->cpu_slab->freelist,
+			get_freepointer(s, object));
 		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1847,19 +1845,19 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
-	if (likely(page == c->page && c->node >= 0)) {
-		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
+
+	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
+			__this_cpu_read(s->cpu_slab->node) >= 0)) {
+		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
+		__this_cpu_write(s->cpu_slab->freelist, object);
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (16 preceding siblings ...)
  2009-10-06 23:37 ` [this_cpu_xx V5 18/19] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
@ 2009-10-06 23:37 ` cl
  2009-10-07  2:54   ` Mathieu Desnoyers
  17 siblings, 1 reply; 57+ messages in thread
From: cl @ 2009-10-06 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mathieu Desnoyers, Pekka Enberg, Tejun Heo,
	Mel Gorman

[-- Attachment #1: this_cpu_slub_irqless --]
[-- Type: text/plain, Size: 7487 bytes --]

This is a bit of a different tack on things than the last version provided
by Matheiu.

Instead of using a cmpxchg we keep a state variable in the per cpu structure
that is incremented when we enter the hot path. We can then detect that
a thread is in the fastpath and fall back to alternate allocation / free
technique that bypasses fastpath caching.

A disadvantage is that we have to disable preempt. But if preemt is disabled
(like on most kernels that I run) then the hotpath becomes very efficient.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 include/linux/slub_def.h |    1 
 mm/slub.c                |   91 +++++++++++++++++++++++++++++++++++++----------
 2 files changed, 74 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-10-01 15:53:15.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-10-01 15:53:15.000000000 -0500
@@ -38,6 +38,7 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+	int active;		/* Active fastpaths */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-01 15:53:15.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-01 15:53:15.000000000 -0500
@@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
 			  unsigned long addr)
 {
 	void **object;
-	struct page *page = __this_cpu_read(s->cpu_slab->page);
+	struct page *page;
+	unsigned long flags;
+	int hotpath;
+
+	local_irq_save(flags);
+	preempt_enable();	/* Get rid of count */
+	hotpath = __this_cpu_read(s->cpu_slab->active) != 0;
+	page = __this_cpu_read(s->cpu_slab->page);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1626,13 +1633,21 @@ load_freelist:
 		goto another_slab;
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
 		goto debug;
-
-	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
-	page->inuse = page->objects;
-	page->freelist = NULL;
-	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+	if (unlikely(hotpath)) {
+		/* Object on second free list available and hotpath busy */
+		page->inuse++;
+		page->freelist = get_freepointer(s, object);
+	} else {
+		/* Prepare new list of objects for hotpath */
+		__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
+		page->inuse = page->objects;
+		page->freelist = NULL;
+		__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+	}
 unlock_out:
+	__this_cpu_dec(s->cpu_slab->active);
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
@@ -1642,8 +1657,12 @@ another_slab:
 new_slab:
 	page = get_partial(s, gfpflags, node);
 	if (page) {
-		__this_cpu_write(s->cpu_slab->page, page);
 		stat(s, ALLOC_FROM_PARTIAL);
+
+		if (hotpath)
+			goto hot_lock;
+
+		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
 
@@ -1657,6 +1676,10 @@ new_slab:
 
 	if (page) {
 		stat(s, ALLOC_SLAB);
+
+		if (hotpath)
+			 goto hot_no_lock;
+
 		if (__this_cpu_read(s->cpu_slab->page))
 			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
 		slab_lock(page);
@@ -1664,6 +1687,10 @@ new_slab:
 		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
+
+	__this_cpu_dec(s->cpu_slab->active);
+	local_irq_restore(flags);
+
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
@@ -1675,6 +1702,19 @@ debug:
 	page->freelist = get_freepointer(s, object);
 	__this_cpu_write(s->cpu_slab->node, -1);
 	goto unlock_out;
+
+	/*
+	 * Hotpath is busy and we need to avoid touching
+	 * hotpath variables
+	 */
+hot_no_lock:
+	slab_lock(page);
+hot_lock:
+	__ClearPageSlubFrozen(page);
+	if (get_freepointer(s, page->freelist))
+		/* Cannot put page into the hotpath. Instead back to partial */
+		add_partial(get_node(s, page_to_nid(page)), page, 0);
+	goto load_freelist;
 }
 
 /*
@@ -1691,7 +1731,6 @@ static __always_inline void *slab_alloc(
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	unsigned long flags;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1701,19 +1740,21 @@ static __always_inline void *slab_alloc(
 	if (should_failslab(s->objsize, gfpflags))
 		return NULL;
 
-	local_irq_save(flags);
+	preempt_disable();
+	irqsafe_cpu_inc(s->cpu_slab->active);
 	object = __this_cpu_read(s->cpu_slab->freelist);
-	if (unlikely(!object || !node_match(s, node)))
+	if (unlikely(!object || !node_match(s, node) ||
+			__this_cpu_read(s->cpu_slab->active)))
 
 		object = __slab_alloc(s, gfpflags, node, addr);
 
 	else {
 		__this_cpu_write(s->cpu_slab->freelist,
 			get_freepointer(s, object));
+		irqsafe_cpu_dec(s->cpu_slab->active);
+		preempt_enable();
 		stat(s, ALLOC_FASTPATH);
 	}
-	local_irq_restore(flags);
-
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, s->objsize);
 
@@ -1777,6 +1818,11 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	preempt_enable();	/* Fix up count */
+	__this_cpu_dec(s->cpu_slab->active);
 
 	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
@@ -1809,6 +1855,7 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
+	local_irq_restore(flags);
 	return;
 
 slab_empty:
@@ -1820,6 +1867,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -1845,24 +1893,26 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
-	local_irq_save(flags);
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 
+	preempt_disable();
+	irqsafe_cpu_inc(s->cpu_slab->active);
 	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
-			__this_cpu_read(s->cpu_slab->node) >= 0)) {
-		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
+			__this_cpu_read(s->cpu_slab->node) >= 0) &&
+			!__this_cpu_read(s->cpu_slab->active)) {
+		set_freepointer(s, object,
+			__this_cpu_read(s->cpu_slab->freelist));
 		__this_cpu_write(s->cpu_slab->freelist, object);
+		irqsafe_cpu_dec(s->cpu_slab->active);
+		preempt_enable();
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
-
-	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -2064,6 +2114,8 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
+	int cpu;
+
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2073,6 +2125,9 @@ static inline int alloc_kmem_cache_cpus(
 	else
 		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(s->cpu_slab, cpu)->active = -1;
+
 	if (!s->cpu_slab)
 		return 0;
 

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-06 23:36 ` [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
@ 2009-10-06 23:52   ` Tejun Heo
  2009-10-07 14:23     ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2009-10-06 23:52 UTC (permalink / raw)
  To: cl
  Cc: akpm, linux-kernel, David Howells, Ingo Molnar, Rusty Russell,
	Eric Dumazet, Mel Gorman, Pekka Enberg

cl@linux-foundation.org wrote:
> This patch introduces two things: First this_cpu_ptr and then per cpu
> atomic operations.

percpu#for-next is for the most part a stable tree and 1-12 are
already in the tree.  Can you please send the changes based on the
current percpu#for-next?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-06 23:37 ` [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable cl
@ 2009-10-07  2:54   ` Mathieu Desnoyers
  2009-10-07  9:11     ` Peter Zijlstra
  2009-10-07 15:25     ` Christoph Lameter
  0 siblings, 2 replies; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07  2:54 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman

* cl@linux-foundation.org (cl@linux-foundation.org) wrote:
> This is a bit of a different tack on things than the last version provided
> by Matheiu.
> 

Hi Christoph,

Btw, my name is "Math-ieu" ;) I'm not offended by you fat-fingering my
name, it's just rather funny. :)

Please see comments below,

> Instead of using a cmpxchg we keep a state variable in the per cpu structure
> that is incremented when we enter the hot path. We can then detect that
> a thread is in the fastpath and fall back to alternate allocation / free
> technique that bypasses fastpath caching.
> 
> A disadvantage is that we have to disable preempt. But if preemt is disabled
> (like on most kernels that I run) then the hotpath becomes very efficient.
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> 
> ---
>  include/linux/slub_def.h |    1 
>  mm/slub.c                |   91 +++++++++++++++++++++++++++++++++++++----------
>  2 files changed, 74 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2009-10-01 15:53:15.000000000 -0500
> +++ linux-2.6/include/linux/slub_def.h	2009-10-01 15:53:15.000000000 -0500
> @@ -38,6 +38,7 @@ struct kmem_cache_cpu {
>  	void **freelist;	/* Pointer to first free per cpu object */
>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
> +	int active;		/* Active fastpaths */
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2009-10-01 15:53:15.000000000 -0500
> +++ linux-2.6/mm/slub.c	2009-10-01 15:53:15.000000000 -0500
> @@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
>  			  unsigned long addr)
>  {
>  	void **object;
> -	struct page *page = __this_cpu_read(s->cpu_slab->page);
> +	struct page *page;
> +	unsigned long flags;
> +	int hotpath;
> +
> +	local_irq_save(flags);
> +	preempt_enable();	/* Get rid of count */

Ugh ? Is that legit ?

check preempt in irq off context might have weird side-effects on
scheduler real-time behavior (at least). You end up quitting a preempt
off section in irq off mode. So the sched check is skipped, and later
you only re-enable interrupts, which depends on having had a timer
interrupt waiting on the interrupt line to ensure scheduler timings.
But if the timer interrupt arrived while you were in the preempt off
section, you're doomed. The scheduler will not be woken up at interrupt
enable.

> +	hotpath = __this_cpu_read(s->cpu_slab->active) != 0;
> +	page = __this_cpu_read(s->cpu_slab->page);
>  
>  	/* We handle __GFP_ZERO in the caller */
>  	gfpflags &= ~__GFP_ZERO;
> @@ -1626,13 +1633,21 @@ load_freelist:
>  		goto another_slab;
>  	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
>  		goto debug;
> -
> -	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
> -	page->inuse = page->objects;
> -	page->freelist = NULL;
> -	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
> +	if (unlikely(hotpath)) {
> +		/* Object on second free list available and hotpath busy */
> +		page->inuse++;
> +		page->freelist = get_freepointer(s, object);
> +	} else {
> +		/* Prepare new list of objects for hotpath */
> +		__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
> +		page->inuse = page->objects;
> +		page->freelist = NULL;
> +		__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
> +	}
>  unlock_out:
> +	__this_cpu_dec(s->cpu_slab->active);
>  	slab_unlock(page);
> +	local_irq_restore(flags);
>  	stat(s, ALLOC_SLOWPATH);
>  	return object;
>  
> @@ -1642,8 +1657,12 @@ another_slab:
>  new_slab:
>  	page = get_partial(s, gfpflags, node);
>  	if (page) {
> -		__this_cpu_write(s->cpu_slab->page, page);
>  		stat(s, ALLOC_FROM_PARTIAL);
> +
> +		if (hotpath)
> +			goto hot_lock;
> +
> +		__this_cpu_write(s->cpu_slab->page, page);
>  		goto load_freelist;
>  	}
>  
> @@ -1657,6 +1676,10 @@ new_slab:
>  
>  	if (page) {
>  		stat(s, ALLOC_SLAB);
> +
> +		if (hotpath)
> +			 goto hot_no_lock;
> +
>  		if (__this_cpu_read(s->cpu_slab->page))
>  			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
>  		slab_lock(page);
> @@ -1664,6 +1687,10 @@ new_slab:
>  		__this_cpu_write(s->cpu_slab->page, page);
>  		goto load_freelist;
>  	}
> +
> +	__this_cpu_dec(s->cpu_slab->active);
> +	local_irq_restore(flags);
> +
>  	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
>  		slab_out_of_memory(s, gfpflags, node);
>  	return NULL;
> @@ -1675,6 +1702,19 @@ debug:
>  	page->freelist = get_freepointer(s, object);
>  	__this_cpu_write(s->cpu_slab->node, -1);
>  	goto unlock_out;
> +
> +	/*
> +	 * Hotpath is busy and we need to avoid touching
> +	 * hotpath variables
> +	 */
> +hot_no_lock:
> +	slab_lock(page);
> +hot_lock:
> +	__ClearPageSlubFrozen(page);
> +	if (get_freepointer(s, page->freelist))
> +		/* Cannot put page into the hotpath. Instead back to partial */
> +		add_partial(get_node(s, page_to_nid(page)), page, 0);
> +	goto load_freelist;
>  }
>  
>  /*
> @@ -1691,7 +1731,6 @@ static __always_inline void *slab_alloc(
>  		gfp_t gfpflags, int node, unsigned long addr)
>  {
>  	void **object;
> -	unsigned long flags;
>  
>  	gfpflags &= gfp_allowed_mask;
>  
> @@ -1701,19 +1740,21 @@ static __always_inline void *slab_alloc(
>  	if (should_failslab(s->objsize, gfpflags))
>  		return NULL;
>  
> -	local_irq_save(flags);
> +	preempt_disable();
> +	irqsafe_cpu_inc(s->cpu_slab->active);
>  	object = __this_cpu_read(s->cpu_slab->freelist);
> -	if (unlikely(!object || !node_match(s, node)))
> +	if (unlikely(!object || !node_match(s, node) ||
> +			__this_cpu_read(s->cpu_slab->active)))

Interesting approach ! I just wonder about the impact of the
irq off / preempt enable dance.

Mathieu

>  
>  		object = __slab_alloc(s, gfpflags, node, addr);
>  
>  	else {
>  		__this_cpu_write(s->cpu_slab->freelist,
>  			get_freepointer(s, object));
> +		irqsafe_cpu_dec(s->cpu_slab->active);
> +		preempt_enable();
>  		stat(s, ALLOC_FASTPATH);
>  	}
> -	local_irq_restore(flags);
> -
>  	if (unlikely((gfpflags & __GFP_ZERO) && object))
>  		memset(object, 0, s->objsize);
>  
> @@ -1777,6 +1818,11 @@ static void __slab_free(struct kmem_cach
>  {
>  	void *prior;
>  	void **object = (void *)x;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	preempt_enable();	/* Fix up count */
> +	__this_cpu_dec(s->cpu_slab->active);
>  
>  	stat(s, FREE_SLOWPATH);
>  	slab_lock(page);
> @@ -1809,6 +1855,7 @@ checks_ok:
>  
>  out_unlock:
>  	slab_unlock(page);
> +	local_irq_restore(flags);
>  	return;
>  
>  slab_empty:
> @@ -1820,6 +1867,7 @@ slab_empty:
>  		stat(s, FREE_REMOVE_PARTIAL);
>  	}
>  	slab_unlock(page);
> +	local_irq_restore(flags);
>  	stat(s, FREE_SLAB);
>  	discard_slab(s, page);
>  	return;
> @@ -1845,24 +1893,26 @@ static __always_inline void slab_free(st
>  			struct page *page, void *x, unsigned long addr)
>  {
>  	void **object = (void *)x;
> -	unsigned long flags;
>  
>  	kmemleak_free_recursive(x, s->flags);
> -	local_irq_save(flags);
>  	kmemcheck_slab_free(s, object, s->objsize);
>  	debug_check_no_locks_freed(object, s->objsize);
>  	if (!(s->flags & SLAB_DEBUG_OBJECTS))
>  		debug_check_no_obj_freed(object, s->objsize);
>  
> +	preempt_disable();
> +	irqsafe_cpu_inc(s->cpu_slab->active);
>  	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
> -			__this_cpu_read(s->cpu_slab->node) >= 0)) {
> -		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
> +			__this_cpu_read(s->cpu_slab->node) >= 0) &&
> +			!__this_cpu_read(s->cpu_slab->active)) {
> +		set_freepointer(s, object,
> +			__this_cpu_read(s->cpu_slab->freelist));
>  		__this_cpu_write(s->cpu_slab->freelist, object);
> +		irqsafe_cpu_dec(s->cpu_slab->active);
> +		preempt_enable();
>  		stat(s, FREE_FASTPATH);
>  	} else
>  		__slab_free(s, page, x, addr);
> -
> -	local_irq_restore(flags);
>  }
>  
>  void kmem_cache_free(struct kmem_cache *s, void *x)
> @@ -2064,6 +2114,8 @@ static DEFINE_PER_CPU(struct kmem_cache_
>  
>  static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
>  {
> +	int cpu;
> +
>  	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
>  		/*
>  		 * Boot time creation of the kmalloc array. Use static per cpu data
> @@ -2073,6 +2125,9 @@ static inline int alloc_kmem_cache_cpus(
>  	else
>  		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
>  
> +	for_each_possible_cpu(cpu)
> +		per_cpu_ptr(s->cpu_slab, cpu)->active = -1;
> +
>  	if (!s->cpu_slab)
>  		return 0;
>  
> 
> -- 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07  2:54   ` Mathieu Desnoyers
@ 2009-10-07  9:11     ` Peter Zijlstra
  2009-10-07 12:46       ` Mathieu Desnoyers
  2009-10-07 15:25     ` Christoph Lameter
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-07  9:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: cl, akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman

On Tue, 2009-10-06 at 22:54 -0400, Mathieu Desnoyers wrote:
> > +     local_irq_save(flags);
> > +     preempt_enable();       /* Get rid of count */
> 
> Ugh ? Is that legit ?

Yeah, it reads rather awkward, and the comment doesn't make it any
better, but what he's doing is:

slab_alloc()
  preempt_disable();
  __slab_alloc()
    local_irq_save(flags);
    preempt_enable();



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07  9:11     ` Peter Zijlstra
@ 2009-10-07 12:46       ` Mathieu Desnoyers
  2009-10-07 13:01         ` Peter Zijlstra
  2009-10-07 14:42         ` Christoph Lameter
  0 siblings, 2 replies; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 12:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: cl, akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman,
	mingo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2009-10-06 at 22:54 -0400, Mathieu Desnoyers wrote:
> > > +     local_irq_save(flags);
> > > +     preempt_enable();       /* Get rid of count */
> > 
> > Ugh ? Is that legit ?
> 
> Yeah, it reads rather awkward, and the comment doesn't make it any
> better, but what he's doing is:
> 
> slab_alloc()
>   preempt_disable();
>   __slab_alloc()
>     local_irq_save(flags);
>     preempt_enable();
> 

Yes, I understood this is what he was doing, but I wonder about the
impact on the scheduler. If we have:

* Jiffy 1 -- timer interrupt

* preempt disable
* Jiffy 2 -- timer interrupt
  -> here, the scheduler is disabled, so the timer interrupt is skipped.
     The scheduler depends on preempt_check_resched() at preempt_enable()
     to execute in a bounded amount of time.
* local_irq_save
* preempt_enable
  -> interrupts are disabled, scheduler execution is skipped.
* local_irq_restore
  -> the interrupt line is low. The scheduler won't be called. There is
     no preempt_check_resched() call.

* Jiffy 3 -- timer interrupt
  -> Scheduler finally gets executed, missing a whole jiffy.

At the very least, I think an explicit preempt_check_resched should be
added after local_irq_restore().

Also, preempt_enable here should be replaced with
preempt_enable_no_resched().

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 12:46       ` Mathieu Desnoyers
@ 2009-10-07 13:01         ` Peter Zijlstra
  2009-10-07 13:31           ` Mathieu Desnoyers
  2009-10-07 14:21           ` Christoph Lameter
  2009-10-07 14:42         ` Christoph Lameter
  1 sibling, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-07 13:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: cl, akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman,
	mingo

On Wed, 2009-10-07 at 08:46 -0400, Mathieu Desnoyers wrote:
> * local_irq_restore
>   -> the interrupt line is low. The scheduler won't be called. There is
>      no preempt_check_resched() call.

That would be an issue with all irq disable sections, so I don't think
this is actually true.

I tried to find the actual code, but frigging paravirt crap obfuscated
the code enough that I actually gave up.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 13:01         ` Peter Zijlstra
@ 2009-10-07 13:31           ` Mathieu Desnoyers
  2009-10-07 14:37             ` Peter Zijlstra
  2009-10-07 14:21           ` Christoph Lameter
  1 sibling, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: cl, akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman,
	mingo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2009-10-07 at 08:46 -0400, Mathieu Desnoyers wrote:
> > * local_irq_restore
> >   -> the interrupt line is low. The scheduler won't be called. There is
> >      no preempt_check_resched() call.
> 
> That would be an issue with all irq disable sections, so I don't think
> this is actually true.
> 

AFAIK, irq disable sections rely on the fact that if you get a timer
interrupt during this section, the timer interrupt line stays triggered
for the duration of the irqoff section. Therefore, when interrupts are
re-enabled, the interrupt kicks in, so does the scheduler.

This is not the case with the preempt/irqoff dance proposed by
Christoph.

> I tried to find the actual code, but frigging paravirt crap obfuscated
> the code enough that I actually gave up.

You're probably looking for:

arch/x86/include/asm/irqflags.h:

static inline void native_irq_enable(void)
{
        asm volatile("sti": : :"memory");
}

and

static inline void native_restore_fl(unsigned long flags)
{
        asm volatile("push %0 ; popf"
                     : /* no output */
                     :"g" (flags)
                     :"memory", "cc");
}

static inline void raw_local_irq_restore(unsigned long flags)
{
        native_restore_fl(flags);
}

Which are as simple as it gets.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 13:01         ` Peter Zijlstra
  2009-10-07 13:31           ` Mathieu Desnoyers
@ 2009-10-07 14:21           ` Christoph Lameter
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Wed, 7 Oct 2009, Peter Zijlstra wrote:

> On Wed, 2009-10-07 at 08:46 -0400, Mathieu Desnoyers wrote:
> > * local_irq_restore
> >   -> the interrupt line is low. The scheduler won't be called. There is
> >      no preempt_check_resched() call.
>
> That would be an issue with all irq disable sections, so I don't think
> this is actually true.

This was not true in the past... News to me that an irq enable would not
be a scheduling point.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-06 23:52   ` Tejun Heo
@ 2009-10-07 14:23     ` Christoph Lameter
  2009-10-07 15:29       ` Tejun Heo
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, linux-kernel, David Howells, Ingo Molnar, Rusty Russell,
	Eric Dumazet, Mel Gorman, Pekka Enberg

On Wed, 7 Oct 2009, Tejun Heo wrote:

> cl@linux-foundation.org wrote:
> > This patch introduces two things: First this_cpu_ptr and then per cpu
> > atomic operations.
>
> percpu#for-next is for the most part a stable tree and 1-12 are
> already in the tree.  Can you please send the changes based on the
> current percpu#for-next?

ok should get to it by the afternoon.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 13:31           ` Mathieu Desnoyers
@ 2009-10-07 14:37             ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-07 14:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: cl, akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman,
	mingo

On Wed, 2009-10-07 at 09:31 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2009-10-07 at 08:46 -0400, Mathieu Desnoyers wrote:
> > > * local_irq_restore
> > >   -> the interrupt line is low. The scheduler won't be called. There is
> > >      no preempt_check_resched() call.
> > 
> > That would be an issue with all irq disable sections, so I don't think
> > this is actually true.
> > 
> 
> AFAIK, irq disable sections rely on the fact that if you get a timer
> interrupt during this section, the timer interrupt line stays triggered
> for the duration of the irqoff section. Therefore, when interrupts are
> re-enabled, the interrupt kicks in, so does the scheduler.
> 
> This is not the case with the preempt/irqoff dance proposed by
> Christoph.

Ah, you're quite right indeed.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 12:46       ` Mathieu Desnoyers
  2009-10-07 13:01         ` Peter Zijlstra
@ 2009-10-07 14:42         ` Christoph Lameter
  2009-10-07 15:02           ` Mathieu Desnoyers
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 14:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:

> Yes, I understood this is what he was doing, but I wonder about the
> impact on the scheduler. If we have:
>
> * Jiffy 1 -- timer interrupt
>
> * preempt disable
> * Jiffy 2 -- timer interrupt
>   -> here, the scheduler is disabled, so the timer interrupt is skipped.
>      The scheduler depends on preempt_check_resched() at preempt_enable()
>      to execute in a bounded amount of time.

preempt disable does not disable interrupts. The timer interrupt will
occur. The scheduler may not reschedule another job on this processor
when the timer interrupt calls the scheduler_tick. It
may not do load balancing.

> Also, preempt_enable here should be replaced with
> preempt_enable_no_resched().

Used to have that in earlier incarnations but I saw a lot of these being
removed lately.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 14:42         ` Christoph Lameter
@ 2009-10-07 15:02           ` Mathieu Desnoyers
  2009-10-07 15:05             ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 15:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:
> 
> > Yes, I understood this is what he was doing, but I wonder about the
> > impact on the scheduler. If we have:
> >
> > * Jiffy 1 -- timer interrupt
> >
> > * preempt disable
> > * Jiffy 2 -- timer interrupt
> >   -> here, the scheduler is disabled, so the timer interrupt is skipped.
> >      The scheduler depends on preempt_check_resched() at preempt_enable()
> >      to execute in a bounded amount of time.
> 
> preempt disable does not disable interrupts. The timer interrupt will
> occur. The scheduler may not reschedule another job on this processor
> when the timer interrupt calls the scheduler_tick. It
> may not do load balancing.

Yes. All you say here is true. I'm concerned about the _impact_ of this
along with the preempt/irqoff dance you propose. Trimming the following
key points from my execution scenario indeed skips the problem altogether.

Usually, when preemption is disabled, the scheduler restrain from
executing. *Now the important point*: the criterion that bounds the
maximum amount of time before the scheduler will re-check for pending
preemption is when preempt_enable() will re-activate preemption.

But because you run preempt_enable with interrupts off, the scheduler
check is not done. And it's not done when interrupts are re-activated
neither.

Please go back to my complete execution scenario, you'll probably see
the light. ;)

Thanks,

Mathieu

> 
> > Also, preempt_enable here should be replaced with
> > preempt_enable_no_resched().
> 
> Used to have that in earlier incarnations but I saw a lot of these being
> removed lately.
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 15:02           ` Mathieu Desnoyers
@ 2009-10-07 15:05             ` Christoph Lameter
  2009-10-07 15:19               ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 15:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:

> Usually, when preemption is disabled, the scheduler restrain from
> executing. *Now the important point*: the criterion that bounds the
> maximum amount of time before the scheduler will re-check for pending
> preemption is when preempt_enable() will re-activate preemption.

Which creates additional overhead in the allocator.

> But because you run preempt_enable with interrupts off, the scheduler
> check is not done. And it's not done when interrupts are re-activated
> neither.

Ok so we should be moving the preempt_enable after the irq enable. Then we
will call into the scheduler at the end of the slow path. This may add
significantly more overhead that we had before when we simply disabled
and enabled interrupts...



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 15:05             ` Christoph Lameter
@ 2009-10-07 15:19               ` Mathieu Desnoyers
  2009-10-07 15:21                 ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 15:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:
> 
> > Usually, when preemption is disabled, the scheduler restrain from
> > executing. *Now the important point*: the criterion that bounds the
> > maximum amount of time before the scheduler will re-check for pending
> > preemption is when preempt_enable() will re-activate preemption.
> 
> Which creates additional overhead in the allocator.
> 

Which we like to keep as low as possible, I agree.

> > But because you run preempt_enable with interrupts off, the scheduler
> > check is not done. And it's not done when interrupts are re-activated
> > neither.
> 
> Ok so we should be moving the preempt_enable after the irq enable. Then we
> will call into the scheduler at the end of the slow path. This may add
> significantly more overhead that we had before when we simply disabled
> and enabled interrupts...
> 

You are already calling the scheduler when ending the _fast_ path. I
don't see the problem with calling it when you end the slow path
execution.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 15:19               ` Mathieu Desnoyers
@ 2009-10-07 15:21                 ` Christoph Lameter
  2009-10-07 15:41                   ` Mathieu Desnoyers
  2009-10-08  7:52                   ` Peter Zijlstra
  0 siblings, 2 replies; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 15:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:

> You are already calling the scheduler when ending the _fast_ path. I
> don't see the problem with calling it when you end the slow path
> execution.

Well yes that gives rise to the thought of using

	preempt_enable_no_sched()

at the end of the fastpath as well. Constant calls into the scheduler
could be a major performance issue. I dont notice it here since I usually
cannot affort the preempt overhead and build kernels without support for
it.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07  2:54   ` Mathieu Desnoyers
  2009-10-07  9:11     ` Peter Zijlstra
@ 2009-10-07 15:25     ` Christoph Lameter
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 15:25 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, Pekka Enberg, Tejun Heo, Mel Gorman

On Tue, 6 Oct 2009, Mathieu Desnoyers wrote:

> Btw, my name is "Math-ieu" ;) I'm not offended by you fat-fingering my
> name, it's just rather funny. :)

Sorry. But I got this right further down below...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-07 14:23     ` Christoph Lameter
@ 2009-10-07 15:29       ` Tejun Heo
  0 siblings, 0 replies; 57+ messages in thread
From: Tejun Heo @ 2009-10-07 15:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, David Howells, Ingo Molnar, Rusty Russell,
	Eric Dumazet, Mel Gorman, Pekka Enberg

Christoph Lameter wrote:
> On Wed, 7 Oct 2009, Tejun Heo wrote:
> 
>> cl@linux-foundation.org wrote:
>>> This patch introduces two things: First this_cpu_ptr and then per cpu
>>> atomic operations.
>> percpu#for-next is for the most part a stable tree and 1-12 are
>> already in the tree.  Can you please send the changes based on the
>> current percpu#for-next?
> 
> ok should get to it by the afternoon.

thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 15:21                 ` Christoph Lameter
@ 2009-10-07 15:41                   ` Mathieu Desnoyers
  2009-10-07 16:42                     ` Christoph Lameter
  2009-10-08  7:52                   ` Peter Zijlstra
  1 sibling, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 15:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:
> 
> > You are already calling the scheduler when ending the _fast_ path. I
> > don't see the problem with calling it when you end the slow path
> > execution.
> 
> Well yes that gives rise to the thought of using
> 
> 	preempt_enable_no_sched()
> 
> at the end of the fastpath as well. Constant calls into the scheduler
> could be a major performance issue

... but the opposite is a major RT behavior killer.

preempt_check_resched is basically:

a test TIF_NEED_RESCHED
  if true, call to preempt_schedule

So, it's not a call to the scheduler at each and every preempt_enable.
It's _only_ if an interrupt came in during the preempt off section and
the scheduler considered that it should re-schedule soon.

I really don't see what's bothering you here. Testing a thread flag is
incredibly cheap. That's what is typically added to your fast path.

So, correct behavior would be:

preempt disable()
fast path attempt
  if (fast path already taken) {
    local_irq_save();
    slow path.
    local_irq_restore();
  }
preempt_enable()


Mathieu

> . I dont notice it here since I usually
> cannot affort the preempt overhead and build kernels without support for
> it.
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 15:41                   ` Mathieu Desnoyers
@ 2009-10-07 16:42                     ` Christoph Lameter
  2009-10-07 17:12                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-07 16:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:

> preempt_check_resched is basically:
>
> a test TIF_NEED_RESCHED
>   if true, call to preempt_schedule

You did not mention the effect of incrementing the preempt counter and
the barrier(). Adds an additional cacheline to a very hot OS path.
Possibly register effects.

> I really don't see what's bothering you here. Testing a thread flag is
> incredibly cheap. That's what is typically added to your fast path.

I am trying to get rid off all unnecessary overhead. These "incredible
cheap" tricks en masse have caused lots of regressions. And the allocator
hotpaths are overloaded with these "incredibly cheap" checks alreayd.

> So, correct behavior would be:
>
> preempt disable()
> fast path attempt
>   if (fast path already taken) {
>     local_irq_save();
>     slow path.
>     local_irq_restore();
>   }
> preempt_enable()

Ok. If you have to use preempt then you have to suffer I guess..

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 16:42                     ` Christoph Lameter
@ 2009-10-07 17:12                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 17:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:
> 
> > preempt_check_resched is basically:
> >
> > a test TIF_NEED_RESCHED
> >   if true, call to preempt_schedule
> 
> You did not mention the effect of incrementing the preempt counter and
> the barrier(). Adds an additional cacheline to a very hot OS path.
> Possibly register effects.
> 

What you say applies to preempt_enable(). I was describing
preempt_check_resched above, which involves no compiler barrier nor
increment whatsoever.

By the way, the barrier() you are talking about is in
preempt_enable_no_resched(), the very primitive you are considering
using to save these precious cycles.


> > I really don't see what's bothering you here. Testing a thread flag is
> > incredibly cheap. That's what is typically added to your fast path.
> 
> I am trying to get rid off all unnecessary overhead. These "incredible
> cheap" tricks en masse have caused lots of regressions. And the allocator
> hotpaths are overloaded with these "incredibly cheap" checks alreayd.
> 
> > So, correct behavior would be:
> >
> > preempt disable()
> > fast path attempt
> >   if (fast path already taken) {
> >     local_irq_save();
> >     slow path.
> >     local_irq_restore();
> >   }
> > preempt_enable()
> 
> Ok. If you have to use preempt then you have to suffer I guess..
> 

Yes. A user enabling full preemption should be aware that it has a 
performance footprint.

By the way, from what I remember of the slub allocator, you might find 
the following more suited for your needs. I remember that the slow path
sometimes need to reenable interrupts, so:

preempt disable()
fast path attempt
  if (fast path already taken) {
    local_irq_save();
    preempt_enable_no_resched();
    slow path {
      if (!flags & GFP_ATOMIC) {
        local_irq_enable();
        preempt_check_resched();
        ...
        local_irq_disable();
      }
    }
    local_irq_restore();
    preempt_check_resched();
    return;
  }
preempt_enable()

This should work, be efficient and manage to ensure scheduler RT
correctness.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-07 15:21                 ` Christoph Lameter
  2009-10-07 15:41                   ` Mathieu Desnoyers
@ 2009-10-08  7:52                   ` Peter Zijlstra
  2009-10-08 12:44                     ` Mathieu Desnoyers
  2009-10-08 16:11                     ` Christoph Lameter
  1 sibling, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-08  7:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Wed, 2009-10-07 at 11:21 -0400, Christoph Lameter wrote:
> On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:
> 
> > You are already calling the scheduler when ending the _fast_ path. I
> > don't see the problem with calling it when you end the slow path
> > execution.
> 
> Well yes that gives rise to the thought of using
> 
> 	preempt_enable_no_sched()

NACK, delaying the reschedule is not an option

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08  7:52                   ` Peter Zijlstra
@ 2009-10-08 12:44                     ` Mathieu Desnoyers
  2009-10-08 12:53                       ` Peter Zijlstra
  2009-10-08 16:11                     ` Christoph Lameter
  1 sibling, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-08 12:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2009-10-07 at 11:21 -0400, Christoph Lameter wrote:
> > On Wed, 7 Oct 2009, Mathieu Desnoyers wrote:
> > 
> > > You are already calling the scheduler when ending the _fast_ path. I
> > > don't see the problem with calling it when you end the slow path
> > > execution.
> > 
> > Well yes that gives rise to the thought of using
> > 
> > 	preempt_enable_no_sched()
> 
> NACK, delaying the reschedule is not an option

Even if only done with interrupt off, and check resched is called after
each irq enable following this critical section ? I'd like to understand
the reason behind your rejection for this specific case.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 12:44                     ` Mathieu Desnoyers
@ 2009-10-08 12:53                       ` Peter Zijlstra
  2009-10-08 16:17                         ` Christoph Lameter
  2009-10-08 17:22                         ` Mathieu Desnoyers
  0 siblings, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-08 12:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Lameter, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Thu, 2009-10-08 at 08:44 -0400, Mathieu Desnoyers wrote:
> Even if only done with interrupt off, and check resched is called after
> each irq enable following this critical section ? I'd like to understand
> the reason behind your rejection for this specific case.

No, the thing you proposed:

> preempt disable()
> fast path attempt
>   if (fast path already taken) {
>     local_irq_save();
>     preempt_enable_no_resched();
>     slow path {
>       if (!flags & GFP_ATOMIC) {
>         local_irq_enable();
>         preempt_check_resched();
>         ...
>         local_irq_disable();
>       }
>     }
>     local_irq_restore();
>     preempt_check_resched();
>     return;
>   }
> preempt_enable()

Seems ok.

I just don't get why Christoph is getting all upset about the
need_resched() check in preempt_enable(), its still cheaper than poking
at the interrupt flags.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08  7:52                   ` Peter Zijlstra
  2009-10-08 12:44                     ` Mathieu Desnoyers
@ 2009-10-08 16:11                     ` Christoph Lameter
  2009-10-08 17:17                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-08 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Thu, 8 Oct 2009, Peter Zijlstra wrote:

> > 	preempt_enable_no_sched()
>
> NACK, delaying the reschedule is not an option

I ended up just doing a preempt_disable() at the beginning and a
preempt_disable() at the end. That should be easily reviewable.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 12:53                       ` Peter Zijlstra
@ 2009-10-08 16:17                         ` Christoph Lameter
  2009-10-08 17:22                         ` Mathieu Desnoyers
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Lameter @ 2009-10-08 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Thu, 8 Oct 2009, Peter Zijlstra wrote:

> I just don't get why Christoph is getting all upset about the
> need_resched() check in preempt_enable(), its still cheaper than poking
> at the interrupt flags.

Preempt causes too many performance issues. I keep on thinking that I
could make it easier on people to use it. Guess I better leave it simple
then.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 16:11                     ` Christoph Lameter
@ 2009-10-08 17:17                       ` Mathieu Desnoyers
  2009-10-08 17:44                         ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-08 17:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Thu, 8 Oct 2009, Peter Zijlstra wrote:
> 
> > > 	preempt_enable_no_sched()
> >
> > NACK, delaying the reschedule is not an option
> 
> I ended up just doing a preempt_disable() at the beginning and a
> preempt_disable() at the end. That should be easily reviewable.
> 

Then how do you re-enable preemption in the slow path when you need to
block ?

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 12:53                       ` Peter Zijlstra
  2009-10-08 16:17                         ` Christoph Lameter
@ 2009-10-08 17:22                         ` Mathieu Desnoyers
  1 sibling, 0 replies; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-08 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2009-10-08 at 08:44 -0400, Mathieu Desnoyers wrote:
> > Even if only done with interrupt off, and check resched is called after
> > each irq enable following this critical section ? I'd like to understand
> > the reason behind your rejection for this specific case.
> 
> No, the thing you proposed:
> 
> > preempt disable()
> > fast path attempt
> >   if (fast path already taken) {
> >     local_irq_save();
> >     preempt_enable_no_resched();
> >     slow path {
> >       if (!flags & GFP_ATOMIC) {
> >         local_irq_enable();
> >         preempt_check_resched();
> >         ...
> >         local_irq_disable();
> >       }
> >     }
> >     local_irq_restore();
> >     preempt_check_resched();
> >     return;
> >   }
> > preempt_enable()
> 
> Seems ok.
> 
> I just don't get why Christoph is getting all upset about the
> need_resched() check in preempt_enable(), its still cheaper than poking
> at the interrupt flags.

I agree with you. need_resched() check is incredibly cheap. And if
Christoph still complains about the compiler barrier in preempt
enable_no_resched/disable, then I think he should consider the fact that
the compiler does not perform cross-function optimizations, and consider
putting the preempt disable/enable statements close to function
boundaries. Therefore, the impact in terms of compiler optimization
restrictions should be minimal.

The scheme I proposed above should be OK in terms of scheduler effect
and permit to deal with re-enabling preemption in the slow path
appropriately.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 17:17                       ` Mathieu Desnoyers
@ 2009-10-08 17:44                         ` Christoph Lameter
  2009-10-08 19:17                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-08 17:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Thu, 8 Oct 2009, Mathieu Desnoyers wrote:

> Then how do you re-enable preemption in the slow path when you need to
> block ?

preempt_enable();

before the page allocator call and

preempt_disable();

afterwards.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 17:44                         ` Christoph Lameter
@ 2009-10-08 19:17                           ` Mathieu Desnoyers
  2009-10-08 19:21                             ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-08 19:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Thu, 8 Oct 2009, Mathieu Desnoyers wrote:
> 
> > Then how do you re-enable preemption in the slow path when you need to
> > block ?
> 
> preempt_enable();
> 
> before the page allocator call and
> 
> preempt_disable();
> 
> afterwards.
> 

OK, sounds good,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 19:17                           ` Mathieu Desnoyers
@ 2009-10-08 19:21                             ` Christoph Lameter
  2009-10-08 20:37                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-08 19:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Thu, 8 Oct 2009, Mathieu Desnoyers wrote:

> OK, sounds good,

Then here the full patch for review (vs. percpu-next):


From: Christoph Lameter <cl@linux-foundation.org>
Subject: SLUB: Experimental new fastpath w/o interrupt disable

This is a bit of a different tack on things than the last version provided
by Mathieu.

Instead of using a cmpxchg we keep a state variable in the per cpu structure
that is incremented when we enter the hot path. We can then detect that
a thread is in the fastpath and fall back to alternate allocation / free
technique that bypasses fastpath caching.

V1->V2:
- Use safe preempt_enable / disable.
- Enable preempt before calling into the page allocator
- Checkup on hotpath activity changes when returning from page allocator.
- Add barriers.

Todo:
- Verify that this is really safe
- Is this a benefit?

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 include/linux/slub_def.h |    1
 mm/slub.c                |  112 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 96 insertions(+), 17 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-10-08 11:35:59.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-10-08 11:35:59.000000000 -0500
@@ -38,6 +38,7 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+	int active;		/* Active fastpaths */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-08 11:35:59.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-08 14:03:22.000000000 -0500
@@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
 			  unsigned long addr)
 {
 	void **object;
-	struct page *page = __this_cpu_read(s->cpu_slab->page);
+	struct page *page;
+	unsigned long flags;
+	int hotpath;
+
+	local_irq_save(flags);
+	hotpath = __this_cpu_read(s->cpu_slab->active) != 0;
+	__this_cpu_dec(s->cpu_slab->active);	/* Drop count from hotpath */
+	page = __this_cpu_read(s->cpu_slab->page);

 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1627,12 +1634,24 @@ load_freelist:
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
 		goto debug;

-	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
-	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
-	page->inuse = page->objects;
-	page->freelist = NULL;
+	if (unlikely(hotpath)) {
+		/* Object on page free list available and hotpath busy */
+		page->inuse++;
+		page->freelist = get_freepointer(s, object);
+
+	} else {
+
+		/* Prepare new list of objects for hotpath */
+		__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
+		page->inuse = page->objects;
+		page->freelist = NULL;
+		__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+
+	}
+
 unlock_out:
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, ALLOC_SLOWPATH);
 	return object;

@@ -1642,21 +1661,38 @@ another_slab:
 new_slab:
 	page = get_partial(s, gfpflags, node);
 	if (page) {
-		__this_cpu_write(s->cpu_slab->page, page);
 		stat(s, ALLOC_FROM_PARTIAL);
+
+		if (hotpath)
+			goto hot_lock;
+
+		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}

 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();

+	preempt_enable();
 	page = new_slab(s, gfpflags, node);
+	preempt_disable();
+
+	/*
+	 * We have already decremented our count. Someone else
+	 * could be running right now or we were moved to a
+	 * processor that is in the hotpath. So check against -1.
+	 */
+	hotpath = __this_cpu_read(s->cpu_slab->active) != -1;

 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();

 	if (page) {
 		stat(s, ALLOC_SLAB);
+
+		if (hotpath)
+			 goto hot_no_lock;
+
 		if (__this_cpu_read(s->cpu_slab->page))
 			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
 		slab_lock(page);
@@ -1664,9 +1700,13 @@ new_slab:
 		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
+
+	local_irq_restore(flags);
+
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
+
 debug:
 	if (!alloc_debug_processing(s, page, object, addr))
 		goto another_slab;
@@ -1675,6 +1715,20 @@ debug:
 	page->freelist = get_freepointer(s, object);
 	__this_cpu_write(s->cpu_slab->node, -1);
 	goto unlock_out;
+
+	/*
+	 * Hotpath is busy and we need to avoid touching
+	 * hotpath variables
+	 */
+hot_no_lock:
+	slab_lock(page);
+
+hot_lock:
+	__ClearPageSlubFrozen(page);
+	if (get_freepointer(s, page->freelist))
+		/* Cannot put page into the hotpath. Instead back to partial */
+		add_partial(get_node(s, page_to_nid(page)), page, 0);
+	goto load_freelist;
 }

 /*
@@ -1691,7 +1745,6 @@ static __always_inline void *slab_alloc(
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	unsigned long flags;

 	gfpflags &= gfp_allowed_mask;

@@ -1701,18 +1754,27 @@ static __always_inline void *slab_alloc(
 	if (should_failslab(s->objsize, gfpflags))
 		return NULL;

-	local_irq_save(flags);
+	preempt_disable();
+
+	irqsafe_cpu_inc(s->cpu_slab->active);
+	barrier();
 	object = __this_cpu_read(s->cpu_slab->freelist);
-	if (unlikely(!object || !node_match(s, node)))
+	if (unlikely(!object || !node_match(s, node) ||
+			__this_cpu_read(s->cpu_slab->active)))

 		object = __slab_alloc(s, gfpflags, node, addr);

 	else {
+
 		__this_cpu_write(s->cpu_slab->freelist,
 			get_freepointer(s, object));
+		barrier();
+		irqsafe_cpu_dec(s->cpu_slab->active);
 		stat(s, ALLOC_FASTPATH);
+
 	}
-	local_irq_restore(flags);
+
+	preempt_enable();

 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, s->objsize);
@@ -1777,7 +1839,9 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
+	unsigned long flags;

+	local_irq_save(flags);
 	stat(s, FREE_SLOWPATH);
 	slab_lock(page);

@@ -1809,6 +1873,7 @@ checks_ok:

 out_unlock:
 	slab_unlock(page);
+	local_irq_restore(flags);
 	return;

 slab_empty:
@@ -1820,6 +1885,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -1845,24 +1911,31 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	unsigned long flags;

 	kmemleak_free_recursive(x, s->flags);
-	local_irq_save(flags);
+	preempt_disable();
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);

+	irqsafe_cpu_inc(s->cpu_slab->active);
+	barrier();
 	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
-			__this_cpu_read(s->cpu_slab->node) >= 0)) {
-		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
+			__this_cpu_read(s->cpu_slab->node) >= 0) &&
+			!__this_cpu_read(s->cpu_slab->active)) {
+		set_freepointer(s, object,
+			__this_cpu_read(s->cpu_slab->freelist));
 		__this_cpu_write(s->cpu_slab->freelist, object);
+		barrier();
+		irqsafe_cpu_dec(s->cpu_slab->active);
+		preempt_enable();
 		stat(s, FREE_FASTPATH);
-	} else
+	} else {
+		irqsafe_cpu_dec(s->cpu_slab->active);
+		preempt_enable();
 		__slab_free(s, page, x, addr);
-
-	local_irq_restore(flags);
+	}
 }

 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -2064,6 +2137,8 @@ static DEFINE_PER_CPU(struct kmem_cache_

 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
+	int cpu;
+
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2073,6 +2148,9 @@ static inline int alloc_kmem_cache_cpus(
 	else
 		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);

+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(s->cpu_slab, cpu)->active = -1;
+
 	if (!s->cpu_slab)
 		return 0;


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 19:21                             ` Christoph Lameter
@ 2009-10-08 20:37                               ` Mathieu Desnoyers
  2009-10-08 21:08                                 ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-08 20:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Thu, 8 Oct 2009, Mathieu Desnoyers wrote:
> 
> > OK, sounds good,
> 
> Then here the full patch for review (vs. percpu-next):
> 
> 
> From: Christoph Lameter <cl@linux-foundation.org>
> Subject: SLUB: Experimental new fastpath w/o interrupt disable
> 
> This is a bit of a different tack on things than the last version provided
> by Mathieu.
> 
> Instead of using a cmpxchg we keep a state variable in the per cpu structure
> that is incremented when we enter the hot path. We can then detect that
> a thread is in the fastpath and fall back to alternate allocation / free
> technique that bypasses fastpath caching.
> 
> V1->V2:
> - Use safe preempt_enable / disable.
> - Enable preempt before calling into the page allocator
> - Checkup on hotpath activity changes when returning from page allocator.
> - Add barriers.
> 
> Todo:
> - Verify that this is really safe
> - Is this a benefit?
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> 
> ---
>  include/linux/slub_def.h |    1
>  mm/slub.c                |  112 +++++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 96 insertions(+), 17 deletions(-)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2009-10-08 11:35:59.000000000 -0500
> +++ linux-2.6/include/linux/slub_def.h	2009-10-08 11:35:59.000000000 -0500
> @@ -38,6 +38,7 @@ struct kmem_cache_cpu {
>  	void **freelist;	/* Pointer to first free per cpu object */
>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
> +	int active;		/* Active fastpaths */
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2009-10-08 11:35:59.000000000 -0500
> +++ linux-2.6/mm/slub.c	2009-10-08 14:03:22.000000000 -0500
> @@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
>  			  unsigned long addr)
>  {
>  	void **object;
> -	struct page *page = __this_cpu_read(s->cpu_slab->page);
> +	struct page *page;
> +	unsigned long flags;
> +	int hotpath;
> +
> +	local_irq_save(flags);

(Recommend adding)

	preempt_enable_no_resched();


The preempt enable right in the middle of a big function is adding an
unnecessary barrier(), which will restrain gcc from doing its
optimizations.  This might hurt performances.

I still recommend the preempt_enable_no_resched() at the beginning of
__slab_alloc(), and simply putting a check_resched() here (which saves
us the odd compiler barrier in the middle of function).


> +	hotpath = __this_cpu_read(s->cpu_slab->active) != 0;
> +	__this_cpu_dec(s->cpu_slab->active);	/* Drop count from hotpath */
> +	page = __this_cpu_read(s->cpu_slab->page);
> 
>  	/* We handle __GFP_ZERO in the caller */
>  	gfpflags &= ~__GFP_ZERO;
> @@ -1627,12 +1634,24 @@ load_freelist:
>  	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
>  		goto debug;
> 
> -	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
> -	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
> -	page->inuse = page->objects;
> -	page->freelist = NULL;
> +	if (unlikely(hotpath)) {
> +		/* Object on page free list available and hotpath busy */
> +		page->inuse++;
> +		page->freelist = get_freepointer(s, object);
> +
> +	} else {
> +
> +		/* Prepare new list of objects for hotpath */
> +		__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
> +		page->inuse = page->objects;
> +		page->freelist = NULL;
> +		__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
> +
> +	}
> +
>  unlock_out:
>  	slab_unlock(page);
> +	local_irq_restore(flags);
>  	stat(s, ALLOC_SLOWPATH);
>  	return object;
> 
> @@ -1642,21 +1661,38 @@ another_slab:
>  new_slab:
>  	page = get_partial(s, gfpflags, node);
>  	if (page) {
> -		__this_cpu_write(s->cpu_slab->page, page);
>  		stat(s, ALLOC_FROM_PARTIAL);
> +
> +		if (hotpath)
> +			goto hot_lock;
> +
> +		__this_cpu_write(s->cpu_slab->page, page);
>  		goto load_freelist;
>  	}
> 
>  	if (gfpflags & __GFP_WAIT)
>  		local_irq_enable();
> 
> +	preempt_enable();

We could replace the above by:

if (gfpflags & __GFP_WAIT) {
	local_irq_enable();
	preempt_check_resched();
}


>  	page = new_slab(s, gfpflags, node);
> +	preempt_disable();
> +

(remove the above)


> +	/*
> +	 * We have already decremented our count. Someone else
> +	 * could be running right now or we were moved to a
> +	 * processor that is in the hotpath. So check against -1.
> +	 */
> +	hotpath = __this_cpu_read(s->cpu_slab->active) != -1;
> 
>  	if (gfpflags & __GFP_WAIT)
>  		local_irq_disable();
> 
>  	if (page) {
>  		stat(s, ALLOC_SLAB);
> +
> +		if (hotpath)
> +			 goto hot_no_lock;
> +
>  		if (__this_cpu_read(s->cpu_slab->page))
>  			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
>  		slab_lock(page);
> @@ -1664,9 +1700,13 @@ new_slab:
>  		__this_cpu_write(s->cpu_slab->page, page);
>  		goto load_freelist;
>  	}
> +
> +	local_irq_restore(flags);
> +
>  	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
>  		slab_out_of_memory(s, gfpflags, node);
>  	return NULL;
> +
>  debug:
>  	if (!alloc_debug_processing(s, page, object, addr))
>  		goto another_slab;
> @@ -1675,6 +1715,20 @@ debug:
>  	page->freelist = get_freepointer(s, object);
>  	__this_cpu_write(s->cpu_slab->node, -1);
>  	goto unlock_out;
> +
> +	/*
> +	 * Hotpath is busy and we need to avoid touching
> +	 * hotpath variables
> +	 */
> +hot_no_lock:
> +	slab_lock(page);
> +
> +hot_lock:
> +	__ClearPageSlubFrozen(page);
> +	if (get_freepointer(s, page->freelist))
> +		/* Cannot put page into the hotpath. Instead back to partial */
> +		add_partial(get_node(s, page_to_nid(page)), page, 0);
> +	goto load_freelist;
>  }
> 
>  /*
> @@ -1691,7 +1745,6 @@ static __always_inline void *slab_alloc(
>  		gfp_t gfpflags, int node, unsigned long addr)
>  {
>  	void **object;
> -	unsigned long flags;
> 
>  	gfpflags &= gfp_allowed_mask;
> 
> @@ -1701,18 +1754,27 @@ static __always_inline void *slab_alloc(
>  	if (should_failslab(s->objsize, gfpflags))
>  		return NULL;
> 
> -	local_irq_save(flags);
> +	preempt_disable();
> +
> +	irqsafe_cpu_inc(s->cpu_slab->active);
> +	barrier();
>  	object = __this_cpu_read(s->cpu_slab->freelist);
> -	if (unlikely(!object || !node_match(s, node)))
> +	if (unlikely(!object || !node_match(s, node) ||
> +			__this_cpu_read(s->cpu_slab->active)))

Missing a barrier() here ?

The idea is to let gcc know that "active" inc/dec and "freelist" reads
must never be reordered. Even when the decrement is done in the slow
path branch.

> 
>  		object = __slab_alloc(s, gfpflags, node, addr);
> 
>  	else {
> +
>  		__this_cpu_write(s->cpu_slab->freelist,
>  			get_freepointer(s, object));
> +		barrier();
> +		irqsafe_cpu_dec(s->cpu_slab->active);
>  		stat(s, ALLOC_FASTPATH);
> +
>  	}
> -	local_irq_restore(flags);
> +
> +	preempt_enable();

Could move the preempt_enable() above to the else (fast path) branch.

> 
>  	if (unlikely((gfpflags & __GFP_ZERO) && object))
>  		memset(object, 0, s->objsize);
> @@ -1777,7 +1839,9 @@ static void __slab_free(struct kmem_cach
>  {
>  	void *prior;
>  	void **object = (void *)x;
> +	unsigned long flags;
> 
> +	local_irq_save(flags);
>  	stat(s, FREE_SLOWPATH);
>  	slab_lock(page);
> 
> @@ -1809,6 +1873,7 @@ checks_ok:
> 
>  out_unlock:
>  	slab_unlock(page);
> +	local_irq_restore(flags);
>  	return;
> 
>  slab_empty:
> @@ -1820,6 +1885,7 @@ slab_empty:
>  		stat(s, FREE_REMOVE_PARTIAL);
>  	}
>  	slab_unlock(page);
> +	local_irq_restore(flags);
>  	stat(s, FREE_SLAB);
>  	discard_slab(s, page);
>  	return;
> @@ -1845,24 +1911,31 @@ static __always_inline void slab_free(st
>  			struct page *page, void *x, unsigned long addr)
>  {
>  	void **object = (void *)x;
> -	unsigned long flags;
> 
>  	kmemleak_free_recursive(x, s->flags);
> -	local_irq_save(flags);
> +	preempt_disable();
>  	kmemcheck_slab_free(s, object, s->objsize);
>  	debug_check_no_locks_freed(object, s->objsize);
>  	if (!(s->flags & SLAB_DEBUG_OBJECTS))
>  		debug_check_no_obj_freed(object, s->objsize);
> 
> +	irqsafe_cpu_inc(s->cpu_slab->active);
> +	barrier();
>  	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
> -			__this_cpu_read(s->cpu_slab->node) >= 0)) {
> -		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
> +			__this_cpu_read(s->cpu_slab->node) >= 0) &&
> +			!__this_cpu_read(s->cpu_slab->active)) {
> +		set_freepointer(s, object,
> +			__this_cpu_read(s->cpu_slab->freelist));
>  		__this_cpu_write(s->cpu_slab->freelist, object);
> +		barrier();
> +		irqsafe_cpu_dec(s->cpu_slab->active);
> +		preempt_enable();
>  		stat(s, FREE_FASTPATH);
> -	} else
> +	} else {

Perhaps missing a barrier() in the else ?

Thanks,

Mathieu

> +		irqsafe_cpu_dec(s->cpu_slab->active);
> +		preempt_enable();
>  		__slab_free(s, page, x, addr);
> -
> -	local_irq_restore(flags);
> +	}
>  }
> 
>  void kmem_cache_free(struct kmem_cache *s, void *x)
> @@ -2064,6 +2137,8 @@ static DEFINE_PER_CPU(struct kmem_cache_
> 
>  static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
>  {
> +	int cpu;
> +
>  	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
>  		/*
>  		 * Boot time creation of the kmalloc array. Use static per cpu data
> @@ -2073,6 +2148,9 @@ static inline int alloc_kmem_cache_cpus(
>  	else
>  		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
> 
> +	for_each_possible_cpu(cpu)
> +		per_cpu_ptr(s->cpu_slab, cpu)->active = -1;
> +
>  	if (!s->cpu_slab)
>  		return 0;
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 20:37                               ` Mathieu Desnoyers
@ 2009-10-08 21:08                                 ` Christoph Lameter
  2009-10-12 13:56                                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-08 21:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Thu, 8 Oct 2009, Mathieu Desnoyers wrote:

> > Index: linux-2.6/mm/slub.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slub.c	2009-10-08 11:35:59.000000000 -0500
> > +++ linux-2.6/mm/slub.c	2009-10-08 14:03:22.000000000 -0500
> > @@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
> >  			  unsigned long addr)
> >  {
> >  	void **object;
> > -	struct page *page = __this_cpu_read(s->cpu_slab->page);
> > +	struct page *page;
> > +	unsigned long flags;
> > +	int hotpath;
> > +
> > +	local_irq_save(flags);
>
> (Recommend adding)
>
> 	preempt_enable_no_resched();
>
>
> The preempt enable right in the middle of a big function is adding an
> unnecessary barrier(), which will restrain gcc from doing its
> optimizations.  This might hurt performances.

In the middle of the function we have determine that we have to go to the
page allocator to get more memory. There is not much the compiler can do
to speed that up.

> I still recommend the preempt_enable_no_resched() at the beginning of
> __slab_alloc(), and simply putting a check_resched() here (which saves
> us the odd compiler barrier in the middle of function).

Then preemption would be unnecessarily disabled for the page allocator
call?

> >  	if (gfpflags & __GFP_WAIT)
> >  		local_irq_enable();
> >
> > +	preempt_enable();
>
> We could replace the above by:
>
> if (gfpflags & __GFP_WAIT) {
> 	local_irq_enable();
> 	preempt_check_resched();
> }

Which would leave preempt off for the page allocator.

> > +	irqsafe_cpu_inc(s->cpu_slab->active);
> > +	barrier();
> >  	object = __this_cpu_read(s->cpu_slab->freelist);
> > -	if (unlikely(!object || !node_match(s, node)))
> > +	if (unlikely(!object || !node_match(s, node) ||
> > +			__this_cpu_read(s->cpu_slab->active)))
>
> Missing a barrier() here ?

The modifications of the s->cpu_slab->freelist in __slab_alloc() are only
done after interrupts have been disabled and after the slab has been locked.

> The idea is to let gcc know that "active" inc/dec and "freelist" reads
> must never be reordered. Even when the decrement is done in the slow
> path branch.

Right. How could that occur with this code?

> > +		preempt_enable();
> >  		stat(s, FREE_FASTPATH);
> > -	} else
> > +	} else {
>
> Perhaps missing a barrier() in the else ?

Not sure why that would be necessary. __slab_free() does not even touch
s->cpu_slab->freelist if you have the same reasons as in the alloc path.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-08 21:08                                 ` Christoph Lameter
@ 2009-10-12 13:56                                   ` Mathieu Desnoyers
  2009-10-12 14:52                                     ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-12 13:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Thu, 8 Oct 2009, Mathieu Desnoyers wrote:
> 
> > > Index: linux-2.6/mm/slub.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/slub.c	2009-10-08 11:35:59.000000000 -0500
> > > +++ linux-2.6/mm/slub.c	2009-10-08 14:03:22.000000000 -0500
> > > @@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
> > >  			  unsigned long addr)
> > >  {
> > >  	void **object;
> > > -	struct page *page = __this_cpu_read(s->cpu_slab->page);
> > > +	struct page *page;
> > > +	unsigned long flags;
> > > +	int hotpath;
> > > +
> > > +	local_irq_save(flags);
> >
> > (Recommend adding)
> >
> > 	preempt_enable_no_resched();
> >
> >
> > The preempt enable right in the middle of a big function is adding an
> > unnecessary barrier(), which will restrain gcc from doing its
> > optimizations.  This might hurt performances.
> 
> In the middle of the function we have determine that we have to go to the
> page allocator to get more memory. There is not much the compiler can do
> to speed that up.

Indeed, the compiler cannot do much about it. However, the programer
(you) can move the preempt_enable_no_resched() part of the
preempt_enable() to the beginning of the function.

> 
> > I still recommend the preempt_enable_no_resched() at the beginning of
> > __slab_alloc(), and simply putting a check_resched() here (which saves
> > us the odd compiler barrier in the middle of function).
> 
> Then preemption would be unnecessarily disabled for the page allocator
> call?

No ?
preempt_enable_no_resched() enables preemption.

> 
> > >  	if (gfpflags & __GFP_WAIT)
> > >  		local_irq_enable();
> > >
> > > +	preempt_enable();
> >
> > We could replace the above by:
> >
> > if (gfpflags & __GFP_WAIT) {
> > 	local_irq_enable();
> > 	preempt_check_resched();
> > }
> 
> Which would leave preempt off for the page allocator.

Not if you do preempt_enable_no_resched() at the beginnig of the
function, after disabling interrupts.

> 
> > > +	irqsafe_cpu_inc(s->cpu_slab->active);
> > > +	barrier();
> > >  	object = __this_cpu_read(s->cpu_slab->freelist);
> > > -	if (unlikely(!object || !node_match(s, node)))
> > > +	if (unlikely(!object || !node_match(s, node) ||
> > > +			__this_cpu_read(s->cpu_slab->active)))
> >
> > Missing a barrier() here ?
> 
> The modifications of the s->cpu_slab->freelist in __slab_alloc() are only
> done after interrupts have been disabled and after the slab has been locked.

I was concerned about a potential race between
cpu_slab->active/cpu_slab->freelist if an interrupt came in. I
understand that as soon as you get a hint that you must hit the slow
path, you don't care about the order in which these operations have been
done.

> 
> > The idea is to let gcc know that "active" inc/dec and "freelist" reads
> > must never be reordered. Even when the decrement is done in the slow
> > path branch.
> 
> Right. How could that occur with this code?
> 

__slab_alloc calls __this_cpu_dec(s->cpu_slab->active); without any
compiler barrier. But I get that when __slab_alloc is executed, we don't
care about "active" dec to be reordered, because we're not altering fast
path data anymore.

> > > +		preempt_enable();
> > >  		stat(s, FREE_FASTPATH);
> > > -	} else
> > > +	} else {
> >
> > Perhaps missing a barrier() in the else ?
> 
> Not sure why that would be necessary. __slab_free() does not even touch
> s->cpu_slab->freelist if you have the same reasons as in the alloc path.

My intent was to order __this_cpu_read(s->cpu_slab->page) and
irqsafe_cpu_dec(s->cpu_slab->active), but I get that if you run the slow
path, you don't care about some spilling of the slow path over the slab
active critical section.

Thanks,

Mathieu

> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-12 13:56                                   ` Mathieu Desnoyers
@ 2009-10-12 14:52                                     ` Christoph Lameter
  2009-10-12 15:26                                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-12 14:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Mon, 12 Oct 2009, Mathieu Desnoyers wrote:

> > In the middle of the function we have determine that we have to go to the
> > page allocator to get more memory. There is not much the compiler can do
> > to speed that up.
>
> Indeed, the compiler cannot do much about it. However, the programer
> (you) can move the preempt_enable_no_resched() part of the
> preempt_enable() to the beginning of the function.

Ok but then we have the issue that the later irq enable in the
slowpath will not check for preemption.

> > > I still recommend the preempt_enable_no_resched() at the beginning of
> > > __slab_alloc(), and simply putting a check_resched() here (which saves
> > > us the odd compiler barrier in the middle of function).
> >
> > Then preemption would be unnecessarily disabled for the page allocator
> > call?
>
> No ?
> preempt_enable_no_resched() enables preemption.

If I just enable interrupts there then the preempt check will not be
done and we may miss a scheduling point.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-12 15:26                                       ` Mathieu Desnoyers
@ 2009-10-12 15:23                                         ` Christoph Lameter
  2009-10-12 15:38                                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-12 15:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Mon, 12 Oct 2009, Mathieu Desnoyers wrote:

> > If I just enable interrupts there then the preempt check will not be
> > done and we may miss a scheduling point.
> >
>
> That's why you should do:
>
> local_irq_save()
> preempt_enable_no_resched()
> ...
> local_irq_restore()
> preempt_check_resched()

What is the difference then to

local_irq_save()

...

local_irq_enable();
preempt_enable();

?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-12 14:52                                     ` Christoph Lameter
@ 2009-10-12 15:26                                       ` Mathieu Desnoyers
  2009-10-12 15:23                                         ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-12 15:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Mon, 12 Oct 2009, Mathieu Desnoyers wrote:
> 
> > > In the middle of the function we have determine that we have to go to the
> > > page allocator to get more memory. There is not much the compiler can do
> > > to speed that up.
> >
> > Indeed, the compiler cannot do much about it. However, the programer
> > (you) can move the preempt_enable_no_resched() part of the
> > preempt_enable() to the beginning of the function.
> 
> Ok but then we have the issue that the later irq enable in the
> slowpath will not check for preemption.
> 
> > > > I still recommend the preempt_enable_no_resched() at the beginning of
> > > > __slab_alloc(), and simply putting a check_resched() here (which saves
> > > > us the odd compiler barrier in the middle of function).
> > >
> > > Then preemption would be unnecessarily disabled for the page allocator
> > > call?
> >
> > No ?
> > preempt_enable_no_resched() enables preemption.
> 
> If I just enable interrupts there then the preempt check will not be
> done and we may miss a scheduling point.
> 

That's why you should do:

local_irq_save()
preempt_enable_no_resched()
...
local_irq_restore()
preempt_check_resched()

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-12 15:38                                           ` Mathieu Desnoyers
@ 2009-10-12 15:38                                             ` Christoph Lameter
  2009-10-12 16:05                                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2009-10-12 15:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

On Mon, 12 Oct 2009, Mathieu Desnoyers wrote:

> local_irq_enable();
> preempt_enable(); <- barrier()
>
> In the first scenario, the compiler barrier is at the beginning of the
> slow path function, which should impose less restrictions on the compiler
> optimizations.

Again: The next statement is a call to a function that calls the page
allocator. Its one call vs two for me. Plus the flow of things is much
cleaner and less obfuscated.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-12 15:23                                         ` Christoph Lameter
@ 2009-10-12 15:38                                           ` Mathieu Desnoyers
  2009-10-12 15:38                                             ` Christoph Lameter
  0 siblings, 1 reply; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-12 15:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Mon, 12 Oct 2009, Mathieu Desnoyers wrote:
> 
> > > If I just enable interrupts there then the preempt check will not be
> > > done and we may miss a scheduling point.
> > >
> >
> > That's why you should do:
> >
> > local_irq_save()
> > preempt_enable_no_resched()
> > ...
> > local_irq_restore()
> > preempt_check_resched()
> 
> What is the difference then to
> 
> local_irq_save()
> 
> ...
> 
> local_irq_enable();
> preempt_enable();
> 
> ?

local_irq_save()
preempt_enable_no_resched()  <- barrier()
...
local_irq_enable()
preempt_check_resched()

vs

local_irq_save()
...
local_irq_enable();
preempt_enable(); <- barrier()

In the first scenario, the compiler barrier is at the beginning of the
slow path function, which should impose less restrictions on the compiler
optimizations.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-12 15:38                                             ` Christoph Lameter
@ 2009-10-12 16:05                                               ` Mathieu Desnoyers
  0 siblings, 0 replies; 57+ messages in thread
From: Mathieu Desnoyers @ 2009-10-12 16:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, akpm, linux-kernel, Pekka Enberg, Tejun Heo,
	Mel Gorman, mingo

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Mon, 12 Oct 2009, Mathieu Desnoyers wrote:
> 
> > local_irq_enable();
> > preempt_enable(); <- barrier()
> >
> > In the first scenario, the compiler barrier is at the beginning of the
> > slow path function, which should impose less restrictions on the compiler
> > optimizations.
> 
> Again: The next statement is a call to a function that calls the page
> allocator. Its one call vs two for me. Plus the flow of things is much
> cleaner and less obfuscated.
> 

That's a convincing argument, I agree.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2009-10-12 16:05 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-06 23:36 [this_cpu_xx V5 00/19] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
2009-10-06 23:36 ` [this_cpu_xx V5 01/19] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
2009-10-06 23:52   ` Tejun Heo
2009-10-07 14:23     ` Christoph Lameter
2009-10-07 15:29       ` Tejun Heo
2009-10-06 23:36 ` [this_cpu_xx V5 02/19] this_cpu: X86 optimized this_cpu operations cl
2009-10-06 23:36 ` [this_cpu_xx V5 03/19] Use this_cpu operations for SNMP statistics cl
2009-10-06 23:36 ` [this_cpu_xx V5 04/19] Use this_cpu operations for NFS statistics cl
2009-10-06 23:36 ` [this_cpu_xx V5 05/19] use this_cpu ops for network statistics cl
2009-10-06 23:37 ` [this_cpu_xx V5 06/19] this_cpu_ptr: Straight transformations cl
2009-10-06 23:37 ` [this_cpu_xx V5 07/19] this_cpu_ptr: Eliminate get/put_cpu cl
2009-10-06 23:37 ` [this_cpu_xx V5 09/19] Use this_cpu_ptr in crypto subsystem cl
2009-10-06 23:37 ` [this_cpu_xx V5 10/19] Use this_cpu ops for VM statistics cl
2009-10-06 23:37 ` [this_cpu_xx V5 11/19] RCU: Use this_cpu operations cl
2009-10-06 23:37 ` [this_cpu_xx V5 12/19] this_cpu_ops: page allocator conversion cl
2009-10-06 23:37 ` [this_cpu_xx V5 13/19] this_cpu ops: Remove pageset_notifier cl
2009-10-06 23:37 ` [this_cpu_xx V5 14/19] Use this_cpu operations in slub cl
2009-10-06 23:37 ` [this_cpu_xx V5 15/19] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
2009-10-06 23:37 ` [this_cpu_xx V5 16/19] this_cpu: Remove slub kmem_cache fields cl
2009-10-06 23:37 ` [this_cpu_xx V5 17/19] Make slub statistics use this_cpu_inc cl
2009-10-06 23:37 ` [this_cpu_xx V5 18/19] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
2009-10-06 23:37 ` [this_cpu_xx V5 19/19] SLUB: Experimental new fastpath w/o interrupt disable cl
2009-10-07  2:54   ` Mathieu Desnoyers
2009-10-07  9:11     ` Peter Zijlstra
2009-10-07 12:46       ` Mathieu Desnoyers
2009-10-07 13:01         ` Peter Zijlstra
2009-10-07 13:31           ` Mathieu Desnoyers
2009-10-07 14:37             ` Peter Zijlstra
2009-10-07 14:21           ` Christoph Lameter
2009-10-07 14:42         ` Christoph Lameter
2009-10-07 15:02           ` Mathieu Desnoyers
2009-10-07 15:05             ` Christoph Lameter
2009-10-07 15:19               ` Mathieu Desnoyers
2009-10-07 15:21                 ` Christoph Lameter
2009-10-07 15:41                   ` Mathieu Desnoyers
2009-10-07 16:42                     ` Christoph Lameter
2009-10-07 17:12                       ` Mathieu Desnoyers
2009-10-08  7:52                   ` Peter Zijlstra
2009-10-08 12:44                     ` Mathieu Desnoyers
2009-10-08 12:53                       ` Peter Zijlstra
2009-10-08 16:17                         ` Christoph Lameter
2009-10-08 17:22                         ` Mathieu Desnoyers
2009-10-08 16:11                     ` Christoph Lameter
2009-10-08 17:17                       ` Mathieu Desnoyers
2009-10-08 17:44                         ` Christoph Lameter
2009-10-08 19:17                           ` Mathieu Desnoyers
2009-10-08 19:21                             ` Christoph Lameter
2009-10-08 20:37                               ` Mathieu Desnoyers
2009-10-08 21:08                                 ` Christoph Lameter
2009-10-12 13:56                                   ` Mathieu Desnoyers
2009-10-12 14:52                                     ` Christoph Lameter
2009-10-12 15:26                                       ` Mathieu Desnoyers
2009-10-12 15:23                                         ` Christoph Lameter
2009-10-12 15:38                                           ` Mathieu Desnoyers
2009-10-12 15:38                                             ` Christoph Lameter
2009-10-12 16:05                                               ` Mathieu Desnoyers
2009-10-07 15:25     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).