[this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
@ 2009-10-01 21:25 cl
  2009-10-01 21:25 ` [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
                   ` (20 more replies)
  0 siblings, 21 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

V3->V4:
- Fix various macro definitions.
- Provider experimental percpu based fastpath that does not disable
  interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
	 git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
  for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
  can fallback to the use of preempt or interrupt disable. 64 bit
  platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today

The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of

	per_cpu_ptr(x, smp_processor_id())

The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using

	this_cpu_ptr(x)

In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.

F.e.

	this_cpu_add(x->var, 20)

can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the current processor
and then increments the variable by 20. The instruction cannot be interrupted
and therefore the modification is atomic vs the cpu (it either happens or not).
Rescheduling or interrupt can only happen before or after the instruction.

Per cpu atomicness does not provide protection from concurrent modifications from
other processors. In general per cpu data is modified only from the processor
that the per cpu area is associated with. So per cpu atomicness provides a fast
and effective means of dealing with concurrency. It may allow development of
better fastpaths for allocators and other important subsystems.

The per cpu atomic RMW operations can be used to avoid having to dimension pointer
arrays in the allocators (patches for page allocator and slub are provided) and
avoid pointer lookups in the hot paths of the allocators thereby decreasing
latency of critical OS paths. The macros could be used to revise the critical
paths in the allocators to no longer need to disable interrupts (not included).

Per cpu atomic RMW operations are useful to decrease the overhead of counter
maintenance in the kernel. A this_cpu_inc() f.e. can generate a single
instruction that has no needs for registers on x86. preempt on / off can
be avoided in many places.

Patchset will reduce the code size and increase speed of operations for
dynamically allocated per cpu based statistics. A set of patches modifies
the fastpaths of the SLUB allocator reducing code size and cache footprint
through the per cpu atomic operations.

This patch depends on all arches supporting the new per cpu allocator.
IA64 still uses the old percpu allocator. Tejon has patches to fixup IA64
and the patch was approved by Tony Luck but the IA64 patches have not been
merged yet.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
@ 2009-10-01 21:25 ` cl
  2009-10-02  9:16   ` Tejun Heo
  2009-10-02  9:34   ` Ingo Molnar
  2009-10-01 21:25 ` [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations cl
                   ` (19 subsequent siblings)
  20 siblings, 2 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, David Howells, Tejun Heo, Ingo Molnar,
	Rusty Russell, Eric Dumazet, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_intro --]
[-- Type: text/plain, Size: 19810 bytes --]

This patch introduces two things: First this_cpu_ptr and then per cpu
atomic operations.

this_cpu_ptr
------------

A common operation when dealing with cpu data is to get the instance of the
cpu data associated with the currently executing processor. This can be
optimized by

this_cpu_ptr(xx) = per_cpu_ptr(xx, smp_processor_id).

The problem with per_cpu_ptr(x, smp_processor_id) is that it requires
an array lookup to find the offset for the cpu. Processors typically
have the offset for the current cpu area in some kind of (arch dependent)
efficiently accessible register or memory location.

We can use that instead of doing the array lookup to speed up the
determination of the address of the percpu variable. This is particularly
significant because these lookups occur in performance critical paths
of the core kernel. this_cpu_ptr() can avoid memory accesses and

this_cpu_ptr comes in two flavors. The preemption context matters since we
are referring the the currently executing processor. In many cases we must
insure that the processor does not change while a code segment is executed.

__this_cpu_ptr 	-> Do not check for preemption context
this_cpu_ptr	-> Check preemption context

The parameter to these operations is a per cpu pointer. This can be the
address of a statically defined per cpu variable (&per_cpu_var(xxx)) or
the address of a per cpu variable allocated with the per cpu allocator.

per cpu atomic operations: this_cpu_*(var, val)
-----------------------------------------------
this_cpu_* operations (like this_cpu_add(struct->y, value) operate on
abitrary scalars that are members of structures allocated with the new
per cpu allocator. They can also operate on static per_cpu variables
if they are passed to per_cpu_var() (See patch to use this_cpu_*
operations for vm statistics).

These operations are guaranteed to be atomic vs preemption when modifying
the scalar. The calculation of the per cpu offset is also guaranteed to
be atomic at the same time. This means that a this_cpu_* operation can be
safely used to modify a per cpu variable in a context where interrupts are
enabled and preemption is allowed. Many architectures can perform such
a per cpu atomic operation with a single instruction.

Note that the atomicity here is different from regular atomic operations.
Atomicity is only guaranteed for data accessed from the currently executing
processor. Modifications from other processors are still possible. There
must be other guarantees that the per cpu data is not modified from another
processor when using these instruction. The per cpu atomicity is created
by the fact that the processor either executes and instruction or not.
Embedded in the instruction is the relocation of the per cpu address to
the are reserved for the current processor and the RMW action. Therefore
interrupts or preemption cannot occur in the mids of this processing.

Generic fallback functions are used if an arch does not define optimized
this_cpu operations. The functions come also come in the two flavors used
for this_cpu_ptr().

The firstparameter is a scalar that is a member of a structure allocated
through allocpercpu or a per cpu variable (use per_cpu_var(xxx)). The
operations are similar to what percpu_add() and friends do.

this_cpu_read(scalar)
this_cpu_write(scalar, value)
this_cpu_add(scale, value)
this_cpu_sub(scalar, value)
this_cpu_inc(scalar)
this_cpu_dec(scalar)
this_cpu_and(scalar, value)
this_cpu_or(scalar, value)
this_cpu_xor(scalar, value)

Arch code can override the generic functions and provide optimized atomic
per cpu operations. These atomic operations must provide both the relocation
(x86 does it through a segment override) and the operation on the data in a
single instruction. Otherwise preempt needs to be disabled and there is no
gain from providing arch implementations.

A third variant is provided prefixed by irqsafe_. These variants are safe
against hardware interrupts on the *same* processor (all per cpu atomic
primitives are *always* *only* providing safety for code running on the
*same* processor!). The increment needs to be implemented by the hardware
in such a way that it is a single RMW instruction that is either processed
before or after an interrupt.

cc: David Howells <dhowells@redhat.com>
cc: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/asm-generic/percpu.h |    5 
 include/linux/percpu.h       |  400 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 405 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-10-01 14:14:00.000000000 -0500
+++ linux-2.6/include/linux/percpu.h	2009-10-01 15:18:18.000000000 -0500
@@ -243,4 +243,404 @@ do {									\
 # define percpu_xor(var, val)		__percpu_generic_to_op(var, (val), ^=)
 #endif
 
+/*
+ * Branching function to split up a function into a set of functions that
+ * are called for different scalar sizes of the objects handled.
+ */
+
+extern void __bad_size_call_parameter(void);
+
+#define __size_call_return(stem, variable)				\
+({	typeof(variable) ret__;						\
+	switch(sizeof(variable)) {					\
+	case 1: ret__ = stem##1(variable);break;			\
+	case 2: ret__ = stem##2(variable);break;			\
+	case 4: ret__ = stem##4(variable);break;			\
+	case 8: ret__ = stem##8(variable);break;			\
+	default:							\
+		__bad_size_call_parameter();break;			\
+	}								\
+	ret__;								\
+})
+
+#define __size_call(stem, variable, ...)				\
+do {									\
+	switch(sizeof(variable)) {					\
+		case 1: stem##1(variable, __VA_ARGS__);break;		\
+		case 2: stem##2(variable, __VA_ARGS__);break;		\
+		case 4: stem##4(variable, __VA_ARGS__);break;		\
+		case 8: stem##8(variable, __VA_ARGS__);break;		\
+		default: 						\
+			__bad_size_call_parameter();break;		\
+	}								\
+} while (0)
+
+/*
+ * Optimized manipulation for memory allocated through the per cpu
+ * allocator or for addresses of per cpu variables (can be determined
+ * using per_cpu_var(xx).
+ *
+ * These operation guarantee exclusivity of access for other operations
+ * on the *same* processor. The assumption is that per cpu data is only
+ * accessed by a single processor instance (the current one).
+ *
+ * The first group is used for accesses that must be done in a
+ * preemption safe way since we know that the context is not preempt
+ * safe. Interrupts may occur. If the interrupt modifies the variable
+ * too then RMW actions will not be reliable.
+ *
+ * The arch code can provide optimized functions in two ways:
+ *
+ * 1. Override the function completely. F.e. define this_cpu_add().
+ *    The arch must then ensure that the various scalar format passed
+ *    are handled correctly.
+ *
+ * 2. Provide functions for certain scalar sizes. F.e. provide
+ *    this_cpu_add_2() to provide per cpu atomic operations for 2 byte
+ *    sized RMW actions. If arch code does not provide operations for
+ *    a scalar size then the fallback in the generic code will be
+ *    used.
+ */
+
+#define _this_cpu_generic_read(pcp)					\
+({	typeof(pcp) ret__;						\
+	preempt_disable();						\
+	ret__ = *this_cpu_ptr(&(pcp));					\
+	preempt_enable();						\
+	ret__;								\
+})
+
+#ifndef this_cpu_read
+# ifndef this_cpu_read_1
+#  define this_cpu_read_1(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# ifndef this_cpu_read_2
+#  define this_cpu_read_2(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# ifndef this_cpu_read_4
+#  define this_cpu_read_4(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# ifndef this_cpu_read_8
+#  define this_cpu_read_8(pcp)	_this_cpu_generic_read(pcp)
+# endif
+# define this_cpu_read(pcp)	__size_call_return(this_cpu_read_, (pcp))
+#endif
+
+#define _this_cpu_generic_to_op(pcp, val, op)				\
+do {									\
+	preempt_disable();						\
+	*__this_cpu_ptr(&pcp) op val;					\
+	preempt_enable();						\
+} while (0)
+
+#ifndef this_cpu_write
+# ifndef this_cpu_write_1
+#  define this_cpu_write_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef this_cpu_write_2
+#  define this_cpu_write_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef this_cpu_write_4
+#  define this_cpu_write_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef this_cpu_write_8
+#  define this_cpu_write_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# define this_cpu_write(pcp, val)	__size_call(this_cpu_write_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_add
+# ifndef this_cpu_add_1
+#  define this_cpu_add_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef this_cpu_add_2
+#  define this_cpu_add_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef this_cpu_add_4
+#  define this_cpu_add_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef this_cpu_add_8
+#  define this_cpu_add_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# define this_cpu_add(pcp, val)		__size_call(this_cpu_add_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_sub
+# define this_cpu_sub(pcp, val)		this_cpu_add((pcp), -(val))
+#endif
+
+#ifndef this_cpu_inc
+# define this_cpu_inc(pcp)		this_cpu_add((pcp), 1)
+#endif
+
+#ifndef this_cpu_dec
+# define this_cpu_dec(pcp)		this_cpu_sub((pcp), 1)
+#endif
+
+#ifndef this_cpu_and
+# ifndef this_cpu_and_1
+#  define this_cpu_and_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef this_cpu_and_2
+#  define this_cpu_and_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef this_cpu_and_4
+#  define this_cpu_and_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef this_cpu_and_8
+#  define this_cpu_and_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# define this_cpu_and(pcp, val)		__size_call(this_cpu_and_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_or
+# ifndef this_cpu_or_1
+#  define this_cpu_or_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef this_cpu_or_2
+#  define this_cpu_or_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef this_cpu_or_4
+#  define this_cpu_or_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef this_cpu_or_8
+#  define this_cpu_or_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# define this_cpu_or(pcp, val)		__size_call(this_cpu_or_, (pcp), (val))
+#endif
+
+#ifndef this_cpu_xor
+# ifndef this_cpu_xor_1
+#  define this_cpu_xor_1(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef this_cpu_xor_2
+#  define this_cpu_xor_2(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef this_cpu_xor_4
+#  define this_cpu_xor_4(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef this_cpu_xor_8
+#  define this_cpu_xor_8(pcp, val)	_this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# define this_cpu_xor(pcp, val)		__size_call(this_cpu_or_, (pcp), (val))
+#endif
+
+/*
+ * Generic percpu operations that do not require preemption handling.
+ * Either we do not care about races or the caller has the
+ * responsibility of handling preemptions issues. Arch code can still
+ * override these instructions since the arch per cpu code may be more
+ * efficient and may actually get race freeness for free (that is the
+ * case for x86 for example).
+ *
+ * If there is no other protection through preempt disable and/or
+ * disabling interupts then one of these RMW operations can show unexpected
+ * behavior because the execution thread was rescheduled on another processor
+ * or an interrupt occurred and the same percpu variable was modified from
+ * the interrupt context.
+ */
+#ifndef __this_cpu_read
+# ifndef __this_cpu_read_1
+#  define __this_cpu_read_1(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# ifndef __this_cpu_read_2
+#  define __this_cpu_read_2(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# ifndef __this_cpu_read_4
+#  define __this_cpu_read_4(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# ifndef __this_cpu_read_8
+#  define __this_cpu_read_8(pcp)	(*__this_cpu_ptr(&(pcp)))
+# endif
+# define __this_cpu_read(pcp)	__size_call_return(__this_cpu_read_, (pcp))
+#endif
+
+#define __this_cpu_generic_to_op(pcp, val, op)				\
+do {									\
+	*__this_cpu_ptr(&(pcp)) op val;					\
+} while (0)
+
+#ifndef __this_cpu_write
+# ifndef __this_cpu_write_1
+#  define __this_cpu_write_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef __this_cpu_write_2
+#  define __this_cpu_write_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef __this_cpu_write_4
+#  define __this_cpu_write_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# ifndef __this_cpu_write_8
+#  define __this_cpu_write_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), =)
+# endif
+# define __this_cpu_write(pcp, val)	__size_call(__this_cpu_write_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_add
+# ifndef __this_cpu_add_1
+#  define __this_cpu_add_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef __this_cpu_add_2
+#  define __this_cpu_add_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef __this_cpu_add_4
+#  define __this_cpu_add_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef __this_cpu_add_8
+#  define __this_cpu_add_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# define __this_cpu_add(pcp, val)	__size_call(__this_cpu_add_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_sub
+# define __this_cpu_sub(pcp, val)	__this_cpu_add((pcp), -(val))
+#endif
+
+#ifndef __this_cpu_inc
+# define __this_cpu_inc(pcp)		__this_cpu_add((pcp), 1)
+#endif
+
+#ifndef __this_cpu_dec
+# define __this_cpu_dec(pcp)		__this_cpu_sub((pcp), 1)
+#endif
+
+#ifndef __this_cpu_and
+# ifndef __this_cpu_and_1
+#  define __this_cpu_and_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef __this_cpu_and_2
+#  define __this_cpu_and_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef __this_cpu_and_4
+#  define __this_cpu_and_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef __this_cpu_and_8
+#  define __this_cpu_and_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# define __this_cpu_and(pcp, val)	__size_call(__this_cpu_and_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_or
+# ifndef __this_cpu_or_1
+#  define __this_cpu_or_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef __this_cpu_or_2
+#  define __this_cpu_or_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef __this_cpu_or_4
+#  define __this_cpu_or_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef __this_cpu_or_8
+#  define __this_cpu_or_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# define __this_cpu_or(pcp, val)	__size_call(__this_cpu_or_, (pcp), (val))
+#endif
+
+#ifndef __this_cpu_xor
+# ifndef __this_cpu_xor_1
+#  define __this_cpu_xor_1(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef __this_cpu_xor_2
+#  define __this_cpu_xor_2(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef __this_cpu_xor_4
+#  define __this_cpu_xor_4(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef __this_cpu_xor_8
+#  define __this_cpu_xor_8(pcp, val)	__this_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# define __this_cpu_xor(pcp, val)	__size_call(__this_cpu_xor_, (pcp), (val))
+#endif
+
+/*
+ * IRQ safe versions of the per cpu RMW operations. Note that these operations
+ * are *not* safe against modification of the same variable from another
+ * processors (which one gets when using regular atomic operations)
+ . They are guaranteed to be atomic vs. local interrupts and
+ * preemption only.
+ */
+#define irqsafe_cpu_generic_to_op(pcp, val, op)				\
+do {									\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
+	*__this_cpu_ptr(&(pcp)) op val;					\
+	local_irq_restore(flags);					\
+} while (0)
+
+#ifndef irqsafe_cpu_add
+# ifndef irqsafe_cpu_add_1
+#  define irqsafe_cpu_add_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef irqsafe_cpu_add_2
+#  define irqsafe_cpu_add_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef irqsafe_cpu_add_4
+#  define irqsafe_cpu_add_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# ifndef irqsafe_cpu_add_8
+#  define irqsafe_cpu_add_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
+# endif
+# define irqsafe_cpu_add(pcp, val) __size_call(irqsafe_cpu_add_, (pcp), (val))
+#endif
+
+#ifndef irqsafe_cpu_sub
+# define irqsafe_cpu_sub(pcp, val)	irqsafe_cpu_add((pcp), -(val))
+#endif
+
+#ifndef irqsafe_cpu_inc
+# define irqsafe_cpu_inc(pcp)	irqsafe_cpu_add((pcp), 1)
+#endif
+
+#ifndef irqsafe_cpu_dec
+# define irqsafe_cpu_dec(pcp)	irqsafe_cpu_sub((pcp), 1)
+#endif
+
+#ifndef irqsafe_cpu_and
+# ifndef irqsafe_cpu_and_1
+#  define irqsafe_cpu_and_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef irqsafe_cpu_and_2
+#  define irqsafe_cpu_and_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef irqsafe_cpu_and_4
+#  define irqsafe_cpu_and_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# ifndef irqsafe_cpu_and_8
+#  define irqsafe_cpu_and_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
+# endif
+# define irqsafe_cpu_and(pcp, val) __size_call(irqsafe_cpu_and_, (val))
+#endif
+
+#ifndef irqsafe_cpu_or
+# ifndef irqsafe_cpu_or_1
+#  define irqsafe_cpu_or_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef irqsafe_cpu_or_2
+#  define irqsafe_cpu_or_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef irqsafe_cpu_or_4
+#  define irqsafe_cpu_or_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# ifndef irqsafe_cpu_or_8
+#  define irqsafe_cpu_or_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
+# endif
+# define irqsafe_cpu_or(pcp, val) __size_call(irqsafe_cpu_or_, (val))
+#endif
+
+#ifndef irqsafe_cpu_xor
+# ifndef irqsafe_cpu_xor_1
+#  define irqsafe_cpu_xor_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef irqsafe_cpu_xor_2
+#  define irqsafe_cpu_xor_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef irqsafe_cpu_xor_4
+#  define irqsafe_cpu_xor_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# ifndef irqsafe_cpu_xor_8
+#  define irqsafe_cpu_xor_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
+# endif
+# define irqsafe_cpu_xor(pcp, val) __size_call(irqsafe_cpu_xor_, (val))
+#endif
+
 #endif /* __LINUX_PERCPU_H */
Index: linux-2.6/include/asm-generic/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-generic/percpu.h	2009-10-01 14:14:00.000000000 -0500
+++ linux-2.6/include/asm-generic/percpu.h	2009-10-01 14:14:02.000000000 -0500
@@ -56,6 +56,9 @@ extern unsigned long __per_cpu_offset[NR
 #define __raw_get_cpu_var(var) \
 	(*SHIFT_PERCPU_PTR(&per_cpu_var(var), __my_cpu_offset))
 
+#define this_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, my_cpu_offset)
+#define __this_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
+
 
 #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void setup_per_cpu_areas(void);
@@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
 #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu_var(var)))
 #define __get_cpu_var(var)			per_cpu_var(var)
 #define __raw_get_cpu_var(var)			per_cpu_var(var)
+#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
+#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)
 
 #endif	/* SMP */
 

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-01 21:25 ` [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
@ 2009-10-02  9:16   ` Tejun Heo
  2009-10-02  9:34   ` Ingo Molnar
  1 sibling, 0 replies; 65+ messages in thread
From: Tejun Heo @ 2009-10-02  9:16 UTC (permalink / raw)
  To: cl
  Cc: akpm, linux-kernel, David Howells, Ingo Molnar, Rusty Russell,
	Eric Dumazet, Pekka Enberg

Hello,

cl@linux-foundation.org wrote:
> This patch introduces two things: First this_cpu_ptr and then per cpu
> atomic operations.

I'm still not quite sure about the lvalue parameter but given that
get/put_user() is already using it, I don't think my indecisiveness
warrants NACK, so...

Acked-by: Tejun Heo <tj@kernel.org>

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-01 21:25 ` [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
  2009-10-02  9:16   ` Tejun Heo
@ 2009-10-02  9:34   ` Ingo Molnar
  2009-10-02 17:11     ` Christoph Lameter
  1 sibling, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2009-10-02  9:34 UTC (permalink / raw)
  To: cl
  Cc: akpm, linux-kernel, David Howells, Tejun Heo, Rusty Russell,
	Eric Dumazet, Pekka Enberg


* cl@linux-foundation.org <cl@linux-foundation.org> wrote:

> --- linux-2.6.orig/include/asm-generic/percpu.h	2009-10-01 14:14:00.000000000 -0500
> +++ linux-2.6/include/asm-generic/percpu.h	2009-10-01 14:14:02.000000000 -0500

> @@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
>  #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu_var(var)))
>  #define __get_cpu_var(var)			per_cpu_var(var)
>  #define __raw_get_cpu_var(var)			per_cpu_var(var)
> +#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
> +#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)

Small detail: please have a look at the existing vertical alignment 
style of the code there and follow it with new entries.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-02  9:34   ` Ingo Molnar
@ 2009-10-02 17:11     ` Christoph Lameter
  2009-10-06 10:04       ` Rusty Russell
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-02 17:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, linux-kernel, David Howells, Tejun Heo, Rusty Russell,
	Eric Dumazet, Pekka Enberg

On Fri, 2 Oct 2009, Ingo Molnar wrote:

> > @@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
> >  #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu_var(var)))
> >  #define __get_cpu_var(var)			per_cpu_var(var)
> >  #define __raw_get_cpu_var(var)			per_cpu_var(var)
> > +#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
> > +#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)
>
> Small detail: please have a look at the existing vertical alignment
> style of the code there and follow it with new entries.

Ok. Fixed.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-02 17:11     ` Christoph Lameter
@ 2009-10-06 10:04       ` Rusty Russell
  2009-10-06 23:39         ` Christoph Lameter
                           ` (3 more replies)
  0 siblings, 4 replies; 65+ messages in thread
From: Rusty Russell @ 2009-10-06 10:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, akpm, linux-kernel, David Howells, Tejun Heo,
	Eric Dumazet, Pekka Enberg

On Sat, 3 Oct 2009 02:41:54 am Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Ingo Molnar wrote:
> 
> > > @@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
> > >  #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu_var(var)))
> > >  #define __get_cpu_var(var)			per_cpu_var(var)
> > >  #define __raw_get_cpu_var(var)			per_cpu_var(var)
> > > +#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
> > > +#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)

I still think that it's not symmetrical: get_cpu_var implies get_cpu_ptr; there's no
"this" in any Linux API until now.

OTOH, this_cpu_<op> makes much more sense than cpu_<op> or something, so I'm
not really going to complain.

Does this mean we can kill local.h soon?

Thanks for all this!
Rusty.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-06 10:04       ` Rusty Russell
@ 2009-10-06 23:39         ` Christoph Lameter
  2009-10-06 23:55         ` Tejun Heo
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-10-06 23:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, akpm, linux-kernel, David Howells, Tejun Heo,
	Eric Dumazet, Pekka Enberg

On Tue, 6 Oct 2009, Rusty Russell wrote:

> Does this mean we can kill local.h soon?

Yes if you let me ... Last time Andrew had some concerns so I hid my
patches that remove local.h ;-).




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations
  2009-10-06 10:04       ` Rusty Russell
  2009-10-06 23:39         ` Christoph Lameter
@ 2009-10-06 23:55         ` Tejun Heo
  2009-10-08 17:57         ` [Patchs vs. percpu-next] Use this_cpu_xx to dynamically allocate counters Christoph Lameter
  2009-10-08 18:06         ` Christoph Lameter
  3 siblings, 0 replies; 65+ messages in thread
From: Tejun Heo @ 2009-10-06 23:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, Ingo Molnar, akpm, linux-kernel, David Howells,
	Eric Dumazet, Pekka Enberg

Hello,

Rusty Russell wrote:
> I still think that it's not symmetrical: get_cpu_var implies
> get_cpu_ptr; there's no "this" in any Linux API until now.

Yeah, the naming of percpu related stuff is a big mess. :-( Given that
percpu variables aren't being used too widely at this time, cleaning
up the API is an option.  What do you think?  Any good proposal on
mind?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Patchs vs. percpu-next] Use this_cpu_xx to dynamically allocate counters
  2009-10-06 10:04       ` Rusty Russell
  2009-10-06 23:39         ` Christoph Lameter
  2009-10-06 23:55         ` Tejun Heo
@ 2009-10-08 17:57         ` Christoph Lameter
  2009-10-13 11:51           ` Rusty Russell
  2009-10-08 18:06         ` Christoph Lameter
  3 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-08 17:57 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, akpm, linux-kernel, David Howells, Tejun Heo,
	Eric Dumazet, Pekka Enberg

On Tue, 6 Oct 2009, Rusty Russell wrote:

> Does this mean we can kill local.h soon?

Can we remove local.h from modules?



Subject: Module handling: Use this_cpu_xx to dynamically allocate counters

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/module.h     |   36 ++++++++++++------------------------
 kernel/module.c            |   30 ++++++++++++++++--------------
 kernel/trace/ring_buffer.c |    1 +
 3 files changed, 29 insertions(+), 38 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2009-10-08 11:36:05.000000000 -0500
+++ linux-2.6/include/linux/module.h	2009-10-08 11:40:57.000000000 -0500
@@ -16,8 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/tracepoint.h>
-
-#include <asm/local.h>
+#include <linux/percpu.h>
 #include <asm/module.h>

 #include <trace/events/module.h>
@@ -361,11 +360,9 @@ struct module
 	/* Destruction function. */
 	void (*exit)(void);

-#ifdef CONFIG_SMP
-	char *refptr;
-#else
-	local_t ref;
-#endif
+	struct module_ref {
+		int count;
+	} *refptr;
 #endif

 #ifdef CONFIG_CONSTRUCTORS
@@ -452,25 +449,16 @@ void __symbol_put(const char *symbol);
 #define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
 void symbol_put_addr(void *addr);

-static inline local_t *__module_ref_addr(struct module *mod, int cpu)
-{
-#ifdef CONFIG_SMP
-	return (local_t *) (mod->refptr + per_cpu_offset(cpu));
-#else
-	return &mod->ref;
-#endif
-}
-
 /* Sometimes we know we already have a refcount, and it's easier not
    to handle the error case (which only happens with rmmod --wait). */
 static inline void __module_get(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_inc(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_inc(module->refptr->count);
 		trace_module_get(module, _THIS_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
-		put_cpu();
+				 __this_cpu_read(module->refptr->count));
+		preempt_enable();
 	}
 }

@@ -479,15 +467,15 @@ static inline int try_module_get(struct
 	int ret = 1;

 	if (module) {
-		unsigned int cpu = get_cpu();
 		if (likely(module_is_live(module))) {
-			local_inc(__module_ref_addr(module, cpu));
+			preempt_disable();
+			__this_cpu_inc(module->refptr->count);
 			trace_module_get(module, _THIS_IP_,
-				local_read(__module_ref_addr(module, cpu)));
+				__this_cpu_read(module->refptr->count));
+			preempt_enable();
 		}
 		else
 			ret = 0;
-		put_cpu();
 	}
 	return ret;
 }
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c	2009-10-08 11:36:05.000000000 -0500
+++ linux-2.6/kernel/module.c	2009-10-08 11:42:05.000000000 -0500
@@ -474,9 +474,10 @@ static void module_unload_init(struct mo

 	INIT_LIST_HEAD(&mod->modules_which_use_me);
 	for_each_possible_cpu(cpu)
-		local_set(__module_ref_addr(mod, cpu), 0);
+		per_cpu_ptr(mod->refptr, cpu)->count = 0;
+
 	/* Hold reference count during initialization. */
-	local_set(__module_ref_addr(mod, raw_smp_processor_id()), 1);
+	__this_cpu_write(mod->refptr->count, 1);
 	/* Backwards compatibility macros put refcount during init. */
 	mod->waiter = current;
 }
@@ -555,6 +556,7 @@ static void module_unload_free(struct mo
 				kfree(use);
 				sysfs_remove_link(i->holders_dir, mod->name);
 				/* There can be at most one match. */
+				free_percpu(i->refptr);
 				break;
 			}
 		}
@@ -619,7 +621,7 @@ unsigned int module_refcount(struct modu
 	int cpu;

 	for_each_possible_cpu(cpu)
-		total += local_read(__module_ref_addr(mod, cpu));
+		total += per_cpu_ptr(mod->refptr, cpu)->count;
 	return total;
 }
 EXPORT_SYMBOL(module_refcount);
@@ -796,14 +798,15 @@ static struct module_attribute refcnt =
 void module_put(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_dec(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_dec(module->refptr->count);
+
 		trace_module_put(module, _RET_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
+				 __this_cpu_read(module->refptr->count));
 		/* Maybe they're waiting for us to drop reference? */
 		if (unlikely(!module_is_live(module)))
 			wake_up_process(module->waiter);
-		put_cpu();
+		preempt_enable();
 	}
 }
 EXPORT_SYMBOL(module_put);
@@ -1377,9 +1380,9 @@ static void free_module(struct module *m
 	kfree(mod->args);
 	if (mod->percpu)
 		percpu_modfree(mod->percpu);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
+#if defined(CONFIG_MODULE_UNLOAD)
 	if (mod->refptr)
-		percpu_modfree(mod->refptr);
+		free_percpu(mod->refptr);
 #endif
 	/* Free lock-classes: */
 	lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -2145,9 +2148,8 @@ static noinline struct module *load_modu
 	mod = (void *)sechdrs[modindex].sh_addr;
 	kmemleak_load_module(mod, hdr, sechdrs, secstrings);

-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	mod->refptr = percpu_modalloc(sizeof(local_t), __alignof__(local_t),
-				      mod->name);
+#if defined(CONFIG_MODULE_UNLOAD)
+	mod->refptr = alloc_percpu(struct module_ref);
 	if (!mod->refptr) {
 		err = -ENOMEM;
 		goto free_init;
@@ -2373,8 +2375,8 @@ static noinline struct module *load_modu
 	kobject_put(&mod->mkobj.kobj);
  free_unload:
 	module_unload_free(mod);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	percpu_modfree(mod->refptr);
+#if defined(CONFIG_MODULE_UNLOAD)
+	free_percpu(mod->refptr);
  free_init:
 #endif
 	module_free(mod, mod->module_init);
Index: linux-2.6/kernel/trace/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-10-08 12:46:29.000000000 -0500
+++ linux-2.6/kernel/trace/ring_buffer.c	2009-10-08 12:46:46.000000000 -0500
@@ -12,6 +12,7 @@
 #include <linux/hardirq.h>
 #include <linux/kmemcheck.h>
 #include <linux/module.h>
+#include <asm/local.h>
 #include <linux/percpu.h>
 #include <linux/mutex.h>
 #include <linux/init.h>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Patchs vs. percpu-next] Use this_cpu_xx to dynamically allocate counters
  2009-10-08 17:57         ` [Patchs vs. percpu-next] Use this_cpu_xx to dynamically allocate counters Christoph Lameter
@ 2009-10-13 11:51           ` Rusty Russell
  0 siblings, 0 replies; 65+ messages in thread
From: Rusty Russell @ 2009-10-13 11:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, akpm, linux-kernel, David Howells, Tejun Heo,
	Eric Dumazet, Pekka Enberg

On Fri, 9 Oct 2009 04:27:53 am Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Rusty Russell wrote:
> 
> > Does this mean we can kill local.h soon?
> 
> Can we remove local.h from modules?

This looks sweet to me!

Only comment: not sure struct module_ref is required once we have __percpu
markers.

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* (no subject)
  2009-10-06 10:04       ` Rusty Russell
                           ` (2 preceding siblings ...)
  2009-10-08 17:57         ` [Patchs vs. percpu-next] Use this_cpu_xx to dynamically allocate counters Christoph Lameter
@ 2009-10-08 18:06         ` Christoph Lameter
  3 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-10-08 18:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, akpm, linux-kernel, David Howells, Tejun Heo,
	Eric Dumazet, Pekka Enberg

On Tue, 6 Oct 2009, Rusty Russell wrote:

> Does this mean we can kill local.h soon?

We can kill cpu_local_xx right now. this_cpu_xx is a superset of that
functionality and cpu_local_xx is not used at all. We have to wait for the
merging of other patches to do more.


Subject: [percpu next] Remove cpu_local_xx macros

These macros have not been used for awhile now.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 arch/alpha/include/asm/local.h   |   17 -----------------
 arch/m32r/include/asm/local.h    |   25 -------------------------
 arch/mips/include/asm/local.h    |   25 -------------------------
 arch/powerpc/include/asm/local.h |   25 -------------------------
 arch/x86/include/asm/local.h     |   37 -------------------------------------
 include/asm-generic/local.h      |   19 -------------------
 6 files changed, 148 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/alpha/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -98,21 +98,4 @@ static __inline__ long local_sub_return(
 #define __local_add(i,l)	((l)->a.counter+=(i))
 #define __local_sub(i,l)	((l)->a.counter-=(i))

-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
 #endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/arch/m32r/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/m32r/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/m32r/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -338,29 +338,4 @@ static inline void local_set_mask(unsign
  * a variable, not an address.
  */

-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non local way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* __M32R_LOCAL_H */
Index: linux-2.6/arch/mips/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/mips/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/mips/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -193,29 +193,4 @@ static __inline__ long local_sub_return(
 #define __local_add(i, l)	((l)->a.counter+=(i))
 #define __local_sub(i, l)	((l)->a.counter-=(i))

-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/arch/powerpc/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/powerpc/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -172,29 +172,4 @@ static __inline__ long local_dec_if_posi
 #define __local_add(i,l)	((l)->a.counter+=(i))
 #define __local_sub(i,l)	((l)->a.counter-=(i))

-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/arch/x86/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -195,41 +195,4 @@ static inline long local_sub_return(long
 #define __local_add(i, l)	local_add((i), (l))
 #define __local_sub(i, l)	local_sub((i), (l))

-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- *
- * X86_64: This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)		\
-({					\
-	local_t res__;			\
-	preempt_disable(); 		\
-	res__ = (l);			\
-	preempt_enable();		\
-	res__;				\
-})
-#define cpu_local_wrap(l)		\
-({					\
-	preempt_disable();		\
-	(l);				\
-	preempt_enable();		\
-})					\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
-
-#define __cpu_local_inc(l)	cpu_local_inc((l))
-#define __cpu_local_dec(l)	cpu_local_dec((l))
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ASM_X86_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/include/asm-generic/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -52,23 +52,4 @@ typedef struct
 #define __local_add(i,l)	local_set((l), local_read(l) + (i))
 #define __local_sub(i,l)	local_set((l), local_read(l) - (i))

-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc.  Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
 #endif /* _ASM_GENERIC_LOCAL_H */





^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
  2009-10-01 21:25 ` [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
@ 2009-10-01 21:25 ` cl
  2009-10-02  9:18   ` Tejun Heo
  2009-10-02  9:59   ` Ingo Molnar
  2009-10-01 21:25 ` [this_cpu_xx V4 03/20] Use this_cpu operations for SNMP statistics cl
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_x86_ops --]
[-- Type: text/plain, Size: 5839 bytes --]

Basically the existing percpu ops can be used for this_cpu variants that allow
operations also on dynamically allocated percpu data. However, we do not pass a
reference to a percpu variable in. Instead a dynamically or statically
allocated percpu variable is provided.

Preempt, the non preempt and the irqsafe operations generate the same code.
It will always be possible to have the requires per cpu atomicness in a single
RMW instruction with segment override on x86.

64 bit this_cpu operations are not supported on 32 bit.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/include/asm/percpu.h |   78 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2009-10-01 09:08:37.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/percpu.h	2009-10-01 09:29:43.000000000 -0500
@@ -153,6 +153,84 @@ do {							\
 #define percpu_or(var, val)	percpu_to_op("or", per_cpu__##var, val)
 #define percpu_xor(var, val)	percpu_to_op("xor", per_cpu__##var, val)
 
+#define __this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define __this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define __this_cpu_read_4(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+
+#define __this_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_or_1(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_or_2(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_or_4(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define __this_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define __this_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_read_4(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_or_1(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_or_2(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_or_4(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define this_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define this_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define irqsafe_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_or_1(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_or_2(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_or_4(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+/*
+ * Per cpu atomic 64 bit operations are only available under 64 bit.
+ * 32 bit must fall back to generic operations.
+ */
+#ifdef CONFIG_X86_64
+#define __this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define __this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define __this_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
+#define __this_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
+#define __this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
+#define this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
+#define this_cpu_or_8(pcp, val)		percpu_to_op("or", (pcp), val)
+#define this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#define irqsafe_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
+#define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
+#define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+
+#endif
+
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
 #define x86_test_and_clear_bit_percpu(bit, var)				\
 ({									\

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations
  2009-10-01 21:25 ` [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations cl
@ 2009-10-02  9:18   ` Tejun Heo
  2009-10-02  9:59   ` Ingo Molnar
  1 sibling, 0 replies; 65+ messages in thread
From: Tejun Heo @ 2009-10-02  9:18 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, mingo, rusty, Pekka Enberg

cl@linux-foundation.org wrote:
> Basically the existing percpu ops can be used for this_cpu variants that allow
> operations also on dynamically allocated percpu data. However, we do not pass a
> reference to a percpu variable in. Instead a dynamically or statically
> allocated percpu variable is provided.
> 
> Preempt, the non preempt and the irqsafe operations generate the same code.
> It will always be possible to have the requires per cpu atomicness in a single
> RMW instruction with segment override on x86.
> 
> 64 bit this_cpu operations are not supported on 32 bit.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: Tejun Heo <tj@kernel.org>

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations
  2009-10-01 21:25 ` [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations cl
  2009-10-02  9:18   ` Tejun Heo
@ 2009-10-02  9:59   ` Ingo Molnar
  2009-10-03 19:33     ` Pekka Enberg
  1 sibling, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2009-10-02  9:59 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, Tejun Heo, rusty, Pekka Enberg


* cl@linux-foundation.org <cl@linux-foundation.org> wrote:

> Basically the existing percpu ops can be used for this_cpu variants 
> that allow operations also on dynamically allocated percpu data. 
> However, we do not pass a reference to a percpu variable in. Instead a 
> dynamically or statically allocated percpu variable is provided.
> 
> Preempt, the non preempt and the irqsafe operations generate the same 
> code. It will always be possible to have the requires per cpu 
> atomicness in a single RMW instruction with segment override on x86.
> 
> 64 bit this_cpu operations are not supported on 32 bit.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations
  2009-10-02  9:59   ` Ingo Molnar
@ 2009-10-03 19:33     ` Pekka Enberg
  2009-10-04 16:47       ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Pekka Enberg @ 2009-10-03 19:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: cl, akpm, linux-kernel, Tejun Heo, rusty

Hi,

Ingo Molnar wrote:
> * cl@linux-foundation.org <cl@linux-foundation.org> wrote:
> 
>> Basically the existing percpu ops can be used for this_cpu variants 
>> that allow operations also on dynamically allocated percpu data. 
>> However, we do not pass a reference to a percpu variable in. Instead a 
>> dynamically or statically allocated percpu variable is provided.
>>
>> Preempt, the non preempt and the irqsafe operations generate the same 
>> code. It will always be possible to have the requires per cpu 
>> atomicness in a single RMW instruction with segment override on x86.
>>
>> 64 bit this_cpu operations are not supported on 32 bit.
>>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Acked-by: Ingo Molnar <mingo@elte.hu>

I haven't looked at the series in detail but AFAICT the SLUB patches 
depend on the x86 ones. Any suggestions how to get all this into 
linux-next? Should I make a topic branch in slab.git on top of -tip or 
something?

			Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations
  2009-10-03 19:33     ` Pekka Enberg
@ 2009-10-04 16:47       ` Ingo Molnar
  2009-10-04 16:51         ` Pekka Enberg
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2009-10-04 16:47 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: cl, akpm, linux-kernel, Tejun Heo, rusty


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi,
>
> Ingo Molnar wrote:
>> * cl@linux-foundation.org <cl@linux-foundation.org> wrote:
>>
>>> Basically the existing percpu ops can be used for this_cpu variants  
>>> that allow operations also on dynamically allocated percpu data.  
>>> However, we do not pass a reference to a percpu variable in. Instead 
>>> a dynamically or statically allocated percpu variable is provided.
>>>
>>> Preempt, the non preempt and the irqsafe operations generate the same 
>>> code. It will always be possible to have the requires per cpu  
>>> atomicness in a single RMW instruction with segment override on x86.
>>>
>>> 64 bit this_cpu operations are not supported on 32 bit.
>>>
>>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>>
>> Acked-by: Ingo Molnar <mingo@elte.hu>
>
> I haven't looked at the series in detail but AFAICT the SLUB patches 
> depend on the x86 ones. Any suggestions how to get all this into 
> linux-next? Should I make a topic branch in slab.git on top of -tip or 
> something?

I'd suggest to keep these patches together in the right topical tree: 
Tejun's percpu tree. Any problem with that approach?

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations
  2009-10-04 16:47       ` Ingo Molnar
@ 2009-10-04 16:51         ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2009-10-04 16:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: cl, akpm, linux-kernel, Tejun Heo, rusty

Hi Ingo,

Ingo Molnar wrote:
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> 
>> Hi,
>>
>> Ingo Molnar wrote:
>>> * cl@linux-foundation.org <cl@linux-foundation.org> wrote:
>>>
>>>> Basically the existing percpu ops can be used for this_cpu variants  
>>>> that allow operations also on dynamically allocated percpu data.  
>>>> However, we do not pass a reference to a percpu variable in. Instead 
>>>> a dynamically or statically allocated percpu variable is provided.
>>>>
>>>> Preempt, the non preempt and the irqsafe operations generate the same 
>>>> code. It will always be possible to have the requires per cpu  
>>>> atomicness in a single RMW instruction with segment override on x86.
>>>>
>>>> 64 bit this_cpu operations are not supported on 32 bit.
>>>>
>>>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>>> Acked-by: Ingo Molnar <mingo@elte.hu>
>> I haven't looked at the series in detail but AFAICT the SLUB patches 
>> depend on the x86 ones. Any suggestions how to get all this into 
>> linux-next? Should I make a topic branch in slab.git on top of -tip or 
>> something?
> 
> I'd suggest to keep these patches together in the right topical tree: 
> Tejun's percpu tree. Any problem with that approach?

I'm fine with that. Just wanted to make sure who is taking the patches 
and if I should pick any of them up. We can get some conflicts between 
the per-cpu tree and slab.git if new SLUB patches get merged but that's 
probably not a huge problem.

			Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 03/20] Use this_cpu operations for SNMP statistics
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
  2009-10-01 21:25 ` [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
  2009-10-01 21:25 ` [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 04/20] Use this_cpu operations for NFS statistics cl
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_snmp --]
[-- Type: text/plain, Size: 2921 bytes --]

SNMP statistic macros can be signficantly simplified.
This will also reduce code size if the arch supports these operations
in hardware.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/net/snmp.h |   50 ++++++++++++++++++--------------------------------
 1 file changed, 18 insertions(+), 32 deletions(-)

Index: linux-2.6/include/net/snmp.h
===================================================================
--- linux-2.6.orig/include/net/snmp.h	2009-09-30 11:37:26.000000000 -0500
+++ linux-2.6/include/net/snmp.h	2009-09-30 12:57:48.000000000 -0500
@@ -136,45 +136,31 @@ struct linux_xfrm_mib {
 #define SNMP_STAT_BHPTR(name)	(name[0])
 #define SNMP_STAT_USRPTR(name)	(name[1])
 
-#define SNMP_INC_STATS_BH(mib, field) 	\
-	(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field]++)
-#define SNMP_INC_STATS_USER(mib, field) \
-	do { \
-		per_cpu_ptr(mib[1], get_cpu())->mibs[field]++; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_INC_STATS(mib, field) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]++; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_DEC_STATS(mib, field) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]--; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_ADD_STATS(mib, field, addend) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field] += addend; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_ADD_STATS_BH(mib, field, addend) 	\
-	(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field] += addend)
-#define SNMP_ADD_STATS_USER(mib, field, addend) 	\
-	do { \
-		per_cpu_ptr(mib[1], get_cpu())->mibs[field] += addend; \
-		put_cpu(); \
-	} while (0)
+#define SNMP_INC_STATS_BH(mib, field)	\
+			__this_cpu_inc(mib[0]->mibs[field])
+#define SNMP_INC_STATS_USER(mib, field)	\
+			this_cpu_inc(mib[1]->mibs[field])
+#define SNMP_INC_STATS(mib, field)	\
+			this_cpu_inc(mib[!in_softirq()]->mibs[field])
+#define SNMP_DEC_STATS(mib, field)	\
+			this_cpu_dec(mib[!in_softirq()]->mibs[field])
+#define SNMP_ADD_STATS_BH(mib, field, addend)	\
+			__this_cpu_add(mib[0]->mibs[field], addend)
+#define SNMP_ADD_STATS_USER(mib, field, addend)	\
+			this_cpu_add(mib[1]->mibs[field], addend)
 #define SNMP_UPD_PO_STATS(mib, basefield, addend)	\
 	do { \
-		__typeof__(mib[0]) ptr = per_cpu_ptr(mib[!in_softirq()], get_cpu());\
+		__typeof__(mib[0]) ptr; \
+		preempt_disable(); \
+		ptr = this_cpu_ptr((mib)[!in_softirq()]); \
 		ptr->mibs[basefield##PKTS]++; \
 		ptr->mibs[basefield##OCTETS] += addend;\
-		put_cpu(); \
+		preempt_enable(); \
 	} while (0)
 #define SNMP_UPD_PO_STATS_BH(mib, basefield, addend)	\
 	do { \
-		__typeof__(mib[0]) ptr = per_cpu_ptr(mib[!in_softirq()], raw_smp_processor_id());\
+		__typeof__(mib[0]) ptr = \
+			__this_cpu_ptr((mib)[!in_softirq()]); \
 		ptr->mibs[basefield##PKTS]++; \
 		ptr->mibs[basefield##OCTETS] += addend;\
 	} while (0)

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 04/20] Use this_cpu operations for NFS statistics
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (2 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 03/20] Use this_cpu operations for SNMP statistics cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 05/20] use this_cpu ops for network statistics cl
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Trond Myklebust, mingo, rusty,
	Pekka Enberg

[-- Attachment #1: this_cpu_nfs --]
[-- Type: text/plain, Size: 1785 bytes --]

Simplify NFS statistics and allow the use of optimized
arch instructions.

Acked-by: Tejun Heo <tj@kernel.org>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/nfs/iostat.h |   24 +++---------------------
 1 file changed, 3 insertions(+), 21 deletions(-)

Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h	2009-09-29 11:57:01.000000000 -0500
+++ linux-2.6/fs/nfs/iostat.h	2009-09-29 12:26:42.000000000 -0500
@@ -25,13 +25,7 @@ struct nfs_iostats {
 static inline void nfs_inc_server_stats(const struct nfs_server *server,
 					enum nfs_stat_eventcounters stat)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(server->io_stats, cpu);
-	iostats->events[stat]++;
-	put_cpu();
+	this_cpu_inc(server->io_stats->events[stat]);
 }
 
 static inline void nfs_inc_stats(const struct inode *inode,
@@ -44,13 +38,7 @@ static inline void nfs_add_server_stats(
 					enum nfs_stat_bytecounters stat,
 					unsigned long addend)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(server->io_stats, cpu);
-	iostats->bytes[stat] += addend;
-	put_cpu();
+	this_cpu_add(server->io_stats->bytes[stat], addend);
 }
 
 static inline void nfs_add_stats(const struct inode *inode,
@@ -65,13 +53,7 @@ static inline void nfs_add_fscache_stats
 					 enum nfs_stat_fscachecounters stat,
 					 unsigned long addend)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(NFS_SERVER(inode)->io_stats, cpu);
-	iostats->fscache[stat] += addend;
-	put_cpu();
+	this_cpu_add(NFS_SERVER(inode)->io_stats->fscache[stat], addend);
 }
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 05/20] use this_cpu ops for network statistics
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (3 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 04/20] Use this_cpu operations for NFS statistics cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 06/20] this_cpu_ptr: Straight transformations cl
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, David Miller, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_net --]
[-- Type: text/plain, Size: 1760 bytes --]

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/net/neighbour.h              |    7 +------
 include/net/netfilter/nf_conntrack.h |    4 ++--
 2 files changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6/include/net/neighbour.h
===================================================================
--- linux-2.6.orig/include/net/neighbour.h	2009-09-30 18:32:31.000000000 -0500
+++ linux-2.6/include/net/neighbour.h	2009-09-30 18:32:55.000000000 -0500
@@ -90,12 +90,7 @@ struct neigh_statistics
 	unsigned long unres_discards;	/* number of unresolved drops */
 };
 
-#define NEIGH_CACHE_STAT_INC(tbl, field)				\
-	do {								\
-		preempt_disable();					\
-		(per_cpu_ptr((tbl)->stats, smp_processor_id())->field)++; \
-		preempt_enable();					\
-	} while (0)
+#define NEIGH_CACHE_STAT_INC(tbl, field) this_cpu_inc((tbl)->stats->field)
 
 struct neighbour
 {
Index: linux-2.6/include/net/netfilter/nf_conntrack.h
===================================================================
--- linux-2.6.orig/include/net/netfilter/nf_conntrack.h	2009-09-30 18:32:57.000000000 -0500
+++ linux-2.6/include/net/netfilter/nf_conntrack.h	2009-09-30 18:34:13.000000000 -0500
@@ -295,11 +295,11 @@ extern unsigned int nf_conntrack_htable_
 extern unsigned int nf_conntrack_max;
 
 #define NF_CT_STAT_INC(net, count)	\
-	(per_cpu_ptr((net)->ct.stat, raw_smp_processor_id())->count++)
+	__this_cpu_inc((net)->ct.stat->count)
 #define NF_CT_STAT_INC_ATOMIC(net, count)		\
 do {							\
 	local_bh_disable();				\
-	per_cpu_ptr((net)->ct.stat, raw_smp_processor_id())->count++;	\
+	__this_cpu_inc((net)->ct.stat->count);		\
 	local_bh_enable();				\
 } while (0)
 

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 06/20] this_cpu_ptr: Straight transformations
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (4 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 05/20] use this_cpu ops for network statistics cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 07/20] this_cpu_ptr: Eliminate get/put_cpu cl
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, David Howells, Tejun Heo, Ingo Molnar,
	Rusty Russell, Eric Dumazet, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_straight_transforms --]
[-- Type: text/plain, Size: 3566 bytes --]

Use this_cpu_ptr and __this_cpu_ptr in locations where straight
transformations are possible because per_cpu_ptr is used with
either smp_processor_id() or raw_smp_processor_id().

cc: David Howells <dhowells@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 drivers/infiniband/hw/ehca/ehca_irq.c |    3 +--
 drivers/net/chelsio/sge.c             |    5 ++---
 drivers/net/loopback.c                |    2 +-
 fs/ext4/mballoc.c                     |    2 +-
 4 files changed, 5 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/net/chelsio/sge.c
===================================================================
--- linux-2.6.orig/drivers/net/chelsio/sge.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/drivers/net/chelsio/sge.c	2009-09-29 11:39:20.000000000 -0500
@@ -1378,7 +1378,7 @@ static void sge_rx(struct sge *sge, stru
 	}
 	__skb_pull(skb, sizeof(*p));
 
-	st = per_cpu_ptr(sge->port_stats[p->iff], smp_processor_id());
+	st = this_cpu_ptr(sge->port_stats[p->iff]);
 
 	skb->protocol = eth_type_trans(skb, adapter->port[p->iff].dev);
 	if ((adapter->flags & RX_CSUM_ENABLED) && p->csum == 0xffff &&
@@ -1780,8 +1780,7 @@ netdev_tx_t t1_start_xmit(struct sk_buff
 {
 	struct adapter *adapter = dev->ml_priv;
 	struct sge *sge = adapter->sge;
-	struct sge_port_stats *st = per_cpu_ptr(sge->port_stats[dev->if_port],
-						smp_processor_id());
+	struct sge_port_stats *st = this_cpu_ptr(sge->port_stats[dev->if_port]);
 	struct cpl_tx_pkt *cpl;
 	struct sk_buff *orig_skb = skb;
 	int ret;
Index: linux-2.6/drivers/net/loopback.c
===================================================================
--- linux-2.6.orig/drivers/net/loopback.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/drivers/net/loopback.c	2009-09-29 11:39:20.000000000 -0500
@@ -81,7 +81,7 @@ static netdev_tx_t loopback_xmit(struct 
 
 	/* it's OK to use per_cpu_ptr() because BHs are off */
 	pcpu_lstats = dev->ml_priv;
-	lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
+	lb_stats = this_cpu_ptr(pcpu_lstats);
 
 	len = skb->len;
 	if (likely(netif_rx(skb) == NET_RX_SUCCESS)) {
Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/fs/ext4/mballoc.c	2009-09-29 11:39:20.000000000 -0500
@@ -4210,7 +4210,7 @@ static void ext4_mb_group_or_file(struct
 	 * per cpu locality group is to reduce the contention between block
 	 * request from multiple CPUs.
 	 */
-	ac->ac_lg = per_cpu_ptr(sbi->s_locality_groups, raw_smp_processor_id());
+	ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups);
 
 	/* we're going to use group allocation */
 	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_irq.c	2009-09-29 09:31:40.000000000 -0500
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c	2009-09-29 11:39:20.000000000 -0500
@@ -826,8 +826,7 @@ static void __cpuinit take_over_work(str
 		cq = list_entry(cct->cq_list.next, struct ehca_cq, entry);
 
 		list_del(&cq->entry);
-		__queue_comp_task(cq, per_cpu_ptr(pool->cpu_comp_tasks,
-						  smp_processor_id()));
+		__queue_comp_task(cq, this_cpu_ptr(pool->cpu_comp_tasks));
 	}
 
 	spin_unlock_irqrestore(&cct->task_lock, flags_cct);

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 07/20] this_cpu_ptr: Eliminate get/put_cpu
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (5 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 06/20] this_cpu_ptr: Straight transformations cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 08/20] this_cpu_ptr: xfs_icsb_modify_counters does not need "cpu" variable cl
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Maciej Sosnowski, Dan Williams, Tejun Heo,
	Eric Biederman, Stephen Hemminger, David L Stevens, mingo, rusty,
	Pekka Enberg

[-- Attachment #1: this_cpu_ptr_eliminate_get_put_cpu --]
[-- Type: text/plain, Size: 4488 bytes --]

There are cases where we can use this_cpu_ptr and as the result
of using this_cpu_ptr() we no longer need to determine the
currently executing cpu.

In those places no get/put_cpu combination is needed anymore.
The local cpu variable can be eliminated.

Preemption still needs to be disabled and enabled since the
modifications of the per cpu variables is not atomic. There may
be multiple per cpu variables modified and those must all
be from the same processor.

Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Eric Biederman <ebiederm@aristanetworks.com>
cc: Stephen Hemminger <shemminger@vyatta.com>
cc: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 drivers/dma/dmaengine.c |   36 +++++++++++++-----------------------
 drivers/net/veth.c      |    7 +++----
 2 files changed, 16 insertions(+), 27 deletions(-)

Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c	2009-09-28 10:08:09.000000000 -0500
+++ linux-2.6/drivers/dma/dmaengine.c	2009-09-29 09:01:54.000000000 -0500
@@ -326,14 +326,7 @@ arch_initcall(dma_channel_table_init);
  */
 struct dma_chan *dma_find_channel(enum dma_transaction_type tx_type)
 {
-	struct dma_chan *chan;
-	int cpu;
-
-	cpu = get_cpu();
-	chan = per_cpu_ptr(channel_table[tx_type], cpu)->chan;
-	put_cpu();
-
-	return chan;
+	return this_cpu_read(channel_table[tx_type]->chan);
 }
 EXPORT_SYMBOL(dma_find_channel);
 
@@ -847,7 +840,6 @@ dma_async_memcpy_buf_to_buf(struct dma_c
 	struct dma_async_tx_descriptor *tx;
 	dma_addr_t dma_dest, dma_src;
 	dma_cookie_t cookie;
-	int cpu;
 	unsigned long flags;
 
 	dma_src = dma_map_single(dev->dev, src, len, DMA_TO_DEVICE);
@@ -866,10 +858,10 @@ dma_async_memcpy_buf_to_buf(struct dma_c
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
+	preempt_disable();
+	__this_cpu_add(chan->local->bytes_transferred, len);
+	__this_cpu_inc(chan->local->memcpy_count);
+	preempt_enable();
 
 	return cookie;
 }
@@ -896,7 +888,6 @@ dma_async_memcpy_buf_to_pg(struct dma_ch
 	struct dma_async_tx_descriptor *tx;
 	dma_addr_t dma_dest, dma_src;
 	dma_cookie_t cookie;
-	int cpu;
 	unsigned long flags;
 
 	dma_src = dma_map_single(dev->dev, kdata, len, DMA_TO_DEVICE);
@@ -913,10 +904,10 @@ dma_async_memcpy_buf_to_pg(struct dma_ch
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
+	preempt_disable();
+	__this_cpu_add(chan->local->bytes_transferred, len);
+	__this_cpu_inc(chan->local->memcpy_count);
+	preempt_enable();
 
 	return cookie;
 }
@@ -945,7 +936,6 @@ dma_async_memcpy_pg_to_pg(struct dma_cha
 	struct dma_async_tx_descriptor *tx;
 	dma_addr_t dma_dest, dma_src;
 	dma_cookie_t cookie;
-	int cpu;
 	unsigned long flags;
 
 	dma_src = dma_map_page(dev->dev, src_pg, src_off, len, DMA_TO_DEVICE);
@@ -963,10 +953,10 @@ dma_async_memcpy_pg_to_pg(struct dma_cha
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
+	preempt_disable();
+	__this_cpu_add(chan->local->bytes_transferred, len);
+	__this_cpu_inc(chan->local->memcpy_count);
+	preempt_enable();
 
 	return cookie;
 }
Index: linux-2.6/drivers/net/veth.c
===================================================================
--- linux-2.6.orig/drivers/net/veth.c	2009-09-17 17:54:16.000000000 -0500
+++ linux-2.6/drivers/net/veth.c	2009-09-29 09:01:54.000000000 -0500
@@ -153,7 +153,7 @@ static netdev_tx_t veth_xmit(struct sk_b
 	struct net_device *rcv = NULL;
 	struct veth_priv *priv, *rcv_priv;
 	struct veth_net_stats *stats, *rcv_stats;
-	int length, cpu;
+	int length;
 
 	skb_orphan(skb);
 
@@ -161,9 +161,8 @@ static netdev_tx_t veth_xmit(struct sk_b
 	rcv = priv->peer;
 	rcv_priv = netdev_priv(rcv);
 
-	cpu = smp_processor_id();
-	stats = per_cpu_ptr(priv->stats, cpu);
-	rcv_stats = per_cpu_ptr(rcv_priv->stats, cpu);
+	stats = this_cpu_ptr(priv->stats);
+	rcv_stats = this_cpu_ptr(rcv_priv->stats);
 
 	if (!(rcv->flags & IFF_UP))
 		goto tx_drop;

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 08/20] this_cpu_ptr: xfs_icsb_modify_counters does not need "cpu" variable
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (6 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 07/20] this_cpu_ptr: Eliminate get/put_cpu cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 09/20] Use this_cpu_ptr in crypto subsystem cl
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Olaf Weber, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_xfs --]
[-- Type: text/plain, Size: 1475 bytes --]

The xfs_icsb_modify_counters() function no longer needs the cpu variable
if we use this_cpu_ptr() and we can get rid of get/put_cpu().

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/xfs/xfs_mount.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/xfs/xfs_mount.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_mount.c	2009-06-15 09:08:01.000000000 -0500
+++ linux-2.6/fs/xfs/xfs_mount.c	2009-06-15 14:20:11.000000000 -0500
@@ -2389,12 +2389,12 @@ xfs_icsb_modify_counters(
 {
 	xfs_icsb_cnts_t	*icsbp;
 	long long	lcounter;	/* long counter for 64 bit fields */
-	int		cpu, ret = 0;
+	int		ret = 0;
 
 	might_sleep();
 again:
-	cpu = get_cpu();
-	icsbp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, cpu);
+	preempt_disable();
+	icsbp = this_cpu_ptr(mp->m_sb_cnts);
 
 	/*
 	 * if the counter is disabled, go to slow path
@@ -2438,11 +2438,11 @@ again:
 		break;
 	}
 	xfs_icsb_unlock_cntr(icsbp);
-	put_cpu();
+	preempt_enable();
 	return 0;
 
 slow_path:
-	put_cpu();
+	preempt_enable();
 
 	/*
 	 * serialise with a mutex so we don't burn lots of cpu on
@@ -2490,7 +2490,7 @@ slow_path:
 
 balance_counter:
 	xfs_icsb_unlock_cntr(icsbp);
-	put_cpu();
+	preempt_enable();
 
 	/*
 	 * We may have multiple threads here if multiple per-cpu

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 09/20] Use this_cpu_ptr in crypto subsystem
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (7 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 08/20] this_cpu_ptr: xfs_icsb_modify_counters does not need "cpu" variable cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 10/20] Use this_cpu ops for VM statistics cl
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Huang Ying, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_ptr_crypto --]
[-- Type: text/plain, Size: 948 bytes --]

Just a slight optimization that removes one array lookup.
The processor number is needed for other things as well so the
get/put_cpu cannot be removed.

Acked-by: Tejun Heo <tj@kernel.org>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 crypto/cryptd.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/crypto/cryptd.c
===================================================================
--- linux-2.6.orig/crypto/cryptd.c	2009-09-14 08:47:15.000000000 -0500
+++ linux-2.6/crypto/cryptd.c	2009-09-15 13:47:11.000000000 -0500
@@ -99,7 +99,7 @@ static int cryptd_enqueue_request(struct
 	struct cryptd_cpu_queue *cpu_queue;
 
 	cpu = get_cpu();
-	cpu_queue = per_cpu_ptr(queue->cpu_queue, cpu);
+	cpu_queue = this_cpu_ptr(queue->cpu_queue);
 	err = crypto_enqueue_request(&cpu_queue->queue, request);
 	queue_work_on(cpu, kcrypto_wq, &cpu_queue->work);
 	put_cpu();

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 10/20] Use this_cpu ops for VM statistics.
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (8 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 09/20] Use this_cpu_ptr in crypto subsystem cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 11/20] RCU: Use this_cpu operations cl
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_vmstats --]
[-- Type: text/plain, Size: 1525 bytes --]

Using per cpu atomics for the vm statistics reduces their overhead.
And in the case of x86 we are guaranteed that they will never race even
in the lax form used for vm statistics.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/vmstat.h |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2009-08-14 10:07:18.000000000 -0500
+++ linux-2.6/include/linux/vmstat.h	2009-09-01 14:59:22.000000000 -0500
@@ -76,24 +76,22 @@ DECLARE_PER_CPU(struct vm_event_state, v
 
 static inline void __count_vm_event(enum vm_event_item item)
 {
-	__get_cpu_var(vm_event_states).event[item]++;
+	__this_cpu_inc(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void count_vm_event(enum vm_event_item item)
 {
-	get_cpu_var(vm_event_states).event[item]++;
-	put_cpu();
+	this_cpu_inc(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void __count_vm_events(enum vm_event_item item, long delta)
 {
-	__get_cpu_var(vm_event_states).event[item] += delta;
+	__this_cpu_add(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 static inline void count_vm_events(enum vm_event_item item, long delta)
 {
-	get_cpu_var(vm_event_states).event[item] += delta;
-	put_cpu();
+	this_cpu_add(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 extern void all_vm_events(unsigned long *);

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 11/20] RCU: Use this_cpu operations
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (9 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 10/20] Use this_cpu ops for VM statistics cl
@ 2009-10-01 21:25 ` cl
  2009-10-03 10:52   ` Tejun Heo
  2009-10-01 21:25 ` [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init() cl
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, Paul E. McKenney, mingo, rusty,
	Pekka Enberg

[-- Attachment #1: this_cpu_rcu --]
[-- Type: text/plain, Size: 1916 bytes --]

RCU does not do dynamic allocations but it increments per cpu variables
a lot. These instructions results in a move to a register and then back
to memory. This patch will make it use the inc/dec instructions on x86
that do not need a register.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 kernel/rcutorture.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/rcutorture.c
===================================================================
--- linux-2.6.orig/kernel/rcutorture.c	2009-09-28 10:08:10.000000000 -0500
+++ linux-2.6/kernel/rcutorture.c	2009-09-29 09:02:00.000000000 -0500
@@ -731,13 +731,13 @@ static void rcu_torture_timer(unsigned l
 		/* Should not happen, but... */
 		pipe_count = RCU_TORTURE_PIPE_LEN;
 	}
-	++__get_cpu_var(rcu_torture_count)[pipe_count];
+	__this_cpu_inc(per_cpu_var(rcu_torture_count)[pipe_count]);
 	completed = cur_ops->completed() - completed;
 	if (completed > RCU_TORTURE_PIPE_LEN) {
 		/* Should not happen, but... */
 		completed = RCU_TORTURE_PIPE_LEN;
 	}
-	++__get_cpu_var(rcu_torture_batch)[completed];
+	__this_cpu_inc(per_cpu_var(rcu_torture_batch)[completed]);
 	preempt_enable();
 	cur_ops->readunlock(idx);
 }
@@ -786,13 +786,13 @@ rcu_torture_reader(void *arg)
 			/* Should not happen, but... */
 			pipe_count = RCU_TORTURE_PIPE_LEN;
 		}
-		++__get_cpu_var(rcu_torture_count)[pipe_count];
+		__this_cpu_inc(per_cpu_var(rcu_torture_count)[pipe_count]);
 		completed = cur_ops->completed() - completed;
 		if (completed > RCU_TORTURE_PIPE_LEN) {
 			/* Should not happen, but... */
 			completed = RCU_TORTURE_PIPE_LEN;
 		}
-		++__get_cpu_var(rcu_torture_batch)[completed];
+		__this_cpu_inc(per_cpu_var(rcu_torture_batch)[completed]);
 		preempt_enable();
 		cur_ops->readunlock(idx);
 		schedule();

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 11/20] RCU: Use this_cpu operations
  2009-10-01 21:25 ` [this_cpu_xx V4 11/20] RCU: Use this_cpu operations cl
@ 2009-10-03 10:52   ` Tejun Heo
  0 siblings, 0 replies; 65+ messages in thread
From: Tejun Heo @ 2009-10-03 10:52 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, Paul E. McKenney, mingo, rusty, Pekka Enberg

cl@linux-foundation.org wrote:
> RCU does not do dynamic allocations but it increments per cpu variables
> a lot. These instructions results in a move to a register and then back
> to memory. This patch will make it use the inc/dec instructions on x86
> that do not need a register.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Patch titles slightly cleaned up and 0001-0011 are committed and
published to percpu#for-next.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (10 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 11/20] RCU: Use this_cpu operations cl
@ 2009-10-01 21:25 ` cl
  2009-10-02 14:16   ` Mel Gorman
  2009-10-03 10:29   ` Tejun Heo
  2009-10-01 21:25 ` [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion cl
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mel Gorman, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_move_initialization --]
[-- Type: text/plain, Size: 5564 bytes --]

Explicitly initialize the pagesets after the per cpu areas have been
initialized. This is necessary in order to be able to use per cpu
operations in later patches.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/ia64/kernel/setup.c       |    1 +
 arch/powerpc/kernel/setup_64.c |    1 +
 arch/sparc/kernel/smp_64.c     |    1 +
 arch/x86/kernel/setup_percpu.c |    2 ++
 include/linux/mm.h             |    1 +
 mm/page_alloc.c                |   40 +++++++++++++++++++++++++++++-----------
 mm/percpu.c                    |    2 ++
 7 files changed, 37 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-01 09:36:19.000000000 -0500
@@ -3270,23 +3270,42 @@ void zone_pcp_update(struct zone *zone)
 	stop_machine(__zone_pcp_update, zone, NULL);
 }
 
-static __meminit void zone_pcp_init(struct zone *zone)
+/*
+ * Early setup of pagesets.
+ *
+ * In the NUMA case the pageset setup simply results in all zones pcp
+ * pointer being directed at a per cpu pageset with zero batchsize.
+ *
+ * This means that every free and every allocation occurs directly from
+ * the buddy allocator tables.
+ *
+ * The pageset never queues pages during early boot and is therefore usable
+ * for every type of zone.
+ */
+__meminit void setup_pagesets(void)
 {
 	int cpu;
-	unsigned long batch = zone_batchsize(zone);
+	struct zone *zone;
 
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+	for_each_zone(zone) {
 #ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
+		unsigned long batch = 0;
+
+		for (cpu = 0; cpu < NR_CPUS; cpu++) {
+			/* Early boot. Slab allocator not functional yet */
+			zone_pcp(zone, cpu) = &boot_pageset[cpu];
+		}
 #else
-		setup_pageset(zone_pcp(zone,cpu), batch);
+		unsigned long batch = zone_batchsize(zone);
 #endif
+
+		for_each_possible_cpu(cpu)
+			setup_pageset(zone_pcp(zone, cpu), batch);
+
+		if (zone->present_pages)
+			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
+				zone->name, zone->present_pages, batch);
 	}
-	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
 }
 
 __meminit int init_currently_empty_zone(struct zone *zone,
@@ -3841,7 +3860,6 @@ static void __paginginit free_area_init_
 
 		zone->prev_priority = DEF_PRIORITY;
 
-		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
 			zone->reclaim_stat.nr_saved_scan[l] = 0;
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-10-01 09:36:19.000000000 -0500
@@ -1060,6 +1060,7 @@ extern void show_mem(void);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
+extern void setup_pagesets(void);
 
 #ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
Index: linux-2.6/arch/ia64/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/setup.c	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/arch/ia64/kernel/setup.c	2009-10-01 09:35:39.000000000 -0500
@@ -864,6 +864,7 @@ void __init
 setup_per_cpu_areas (void)
 {
 	/* start_kernel() requires this... */
+	setup_pagesets();
 }
 #endif
 
Index: linux-2.6/arch/powerpc/kernel/setup_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/setup_64.c	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/arch/powerpc/kernel/setup_64.c	2009-10-01 09:35:39.000000000 -0500
@@ -578,6 +578,7 @@ static void ppc64_do_msg(unsigned int sr
 		snprintf(buf, 128, "%s", msg);
 		ppc_md.progress(buf, 0);
 	}
+	setup_pagesets();
 }
 
 /* Print a boot progress message. */
Index: linux-2.6/arch/sparc/kernel/smp_64.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/smp_64.c	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/arch/sparc/kernel/smp_64.c	2009-10-01 09:35:39.000000000 -0500
@@ -1486,4 +1486,5 @@ void __init setup_per_cpu_areas(void)
 	of_fill_in_cpu_data();
 	if (tlb_type == hypervisor)
 		mdesc_fill_in_cpu_data(cpu_all_mask);
+	setup_pagesets();
 }
Index: linux-2.6/arch/x86/kernel/setup_percpu.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup_percpu.c	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/arch/x86/kernel/setup_percpu.c	2009-10-01 09:35:39.000000000 -0500
@@ -269,4 +269,6 @@ void __init setup_per_cpu_areas(void)
 
 	/* Setup cpu initialized, callin, callout masks */
 	setup_cpu_local_masks();
+
+	setup_pagesets();
 }
Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c	2009-10-01 08:54:19.000000000 -0500
+++ linux-2.6/mm/percpu.c	2009-10-01 09:35:39.000000000 -0500
@@ -2062,5 +2062,7 @@ void __init setup_per_cpu_areas(void)
 	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
 	for_each_possible_cpu(cpu)
 		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
+
+	setup_pagesets();
 }
 #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-01 21:25 ` [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init() cl
@ 2009-10-02 14:16   ` Mel Gorman
  2009-10-02 17:30     ` Christoph Lameter
  2009-10-03 10:29   ` Tejun Heo
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-02 14:16 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Thu, Oct 01, 2009 at 05:25:33PM -0400, cl@linux-foundation.org wrote:
> Explicitly initialize the pagesets after the per cpu areas have been
> initialized. This is necessary in order to be able to use per cpu
> operations in later patches.
> 

Can you be more explicit about this? I think the reasoning is as follows

A later patch will use DEFINE_PER_CPU which allocates memory later in
the boot-cycle after zones have already been initialised. Without this
patch, use of DEFINE_PER_CPU would result in invalid memory accesses
during pageset initialisation.

Have another question below as well.

> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  arch/ia64/kernel/setup.c       |    1 +
>  arch/powerpc/kernel/setup_64.c |    1 +
>  arch/sparc/kernel/smp_64.c     |    1 +
>  arch/x86/kernel/setup_percpu.c |    2 ++
>  include/linux/mm.h             |    1 +
>  mm/page_alloc.c                |   40 +++++++++++++++++++++++++++++-----------
>  mm/percpu.c                    |    2 ++
>  7 files changed, 37 insertions(+), 11 deletions(-)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c	2009-10-01 09:36:19.000000000 -0500
> @@ -3270,23 +3270,42 @@ void zone_pcp_update(struct zone *zone)
>  	stop_machine(__zone_pcp_update, zone, NULL);
>  }
>  
> -static __meminit void zone_pcp_init(struct zone *zone)
> +/*
> + * Early setup of pagesets.
> + *
> + * In the NUMA case the pageset setup simply results in all zones pcp
> + * pointer being directed at a per cpu pageset with zero batchsize.
> + *

The batchsize becomes 1, not 0 if you look at setup_pageset() but that aside,
it's unclear from the comment *why* the batchsize is 1 in the NUMA case.
Maybe something like the following?

=====
In the NUMA case, the boot_pageset is used until the slab allocator is
available to allocate per-zone pagesets as each CPU is brought up. At
this point, the batchsize is set to 1 to prevent pages "leaking" onto the
boot_pageset freelists.
=====

Otherwise, nothing in the patch jumped out at me other than to double
check CPU-up events actually result in process_zones() being called and
that boot_pageset is not being accidentally used in the long term.

> + * This means that every free and every allocation occurs directly from
> + * the buddy allocator tables.
> + *
> + * The pageset never queues pages during early boot and is therefore usable
> + * for every type of zone.
> + */
> +__meminit void setup_pagesets(void)
>  {
>  	int cpu;
> -	unsigned long batch = zone_batchsize(zone);
> +	struct zone *zone;
>  
> -	for (cpu = 0; cpu < NR_CPUS; cpu++) {
> +	for_each_zone(zone) {
>  #ifdef CONFIG_NUMA
> -		/* Early boot. Slab allocator not functional yet */
> -		zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -		setup_pageset(&boot_pageset[cpu],0);
> +		unsigned long batch = 0;
> +
> +		for (cpu = 0; cpu < NR_CPUS; cpu++) {
> +			/* Early boot. Slab allocator not functional yet */
> +			zone_pcp(zone, cpu) = &boot_pageset[cpu];
> +		}
>  #else
> -		setup_pageset(zone_pcp(zone,cpu), batch);
> +		unsigned long batch = zone_batchsize(zone);
>  #endif
> +
> +		for_each_possible_cpu(cpu)
> +			setup_pageset(zone_pcp(zone, cpu), batch);
> +
> +		if (zone->present_pages)
> +			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
> +				zone->name, zone->present_pages, batch);
>  	}
> -	if (zone->present_pages)
> -		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
> -			zone->name, zone->present_pages, batch);
>  }
>  
>  __meminit int init_currently_empty_zone(struct zone *zone,
> @@ -3841,7 +3860,6 @@ static void __paginginit free_area_init_
>  
>  		zone->prev_priority = DEF_PRIORITY;
>  
> -		zone_pcp_init(zone);
>  		for_each_lru(l) {
>  			INIT_LIST_HEAD(&zone->lru[l].list);
>  			zone->reclaim_stat.nr_saved_scan[l] = 0;
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/include/linux/mm.h	2009-10-01 09:36:19.000000000 -0500
> @@ -1060,6 +1060,7 @@ extern void show_mem(void);
>  extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
>  extern int after_bootmem;
> +extern void setup_pagesets(void);
>  
>  #ifdef CONFIG_NUMA
>  extern void setup_per_cpu_pageset(void);
> Index: linux-2.6/arch/ia64/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/ia64/kernel/setup.c	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/ia64/kernel/setup.c	2009-10-01 09:35:39.000000000 -0500
> @@ -864,6 +864,7 @@ void __init
>  setup_per_cpu_areas (void)
>  {
>  	/* start_kernel() requires this... */
> +	setup_pagesets();
>  }
>  #endif
>  
> Index: linux-2.6/arch/powerpc/kernel/setup_64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/setup_64.c	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/powerpc/kernel/setup_64.c	2009-10-01 09:35:39.000000000 -0500
> @@ -578,6 +578,7 @@ static void ppc64_do_msg(unsigned int sr
>  		snprintf(buf, 128, "%s", msg);
>  		ppc_md.progress(buf, 0);
>  	}
> +	setup_pagesets();
>  }
>  
>  /* Print a boot progress message. */
> Index: linux-2.6/arch/sparc/kernel/smp_64.c
> ===================================================================
> --- linux-2.6.orig/arch/sparc/kernel/smp_64.c	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/sparc/kernel/smp_64.c	2009-10-01 09:35:39.000000000 -0500
> @@ -1486,4 +1486,5 @@ void __init setup_per_cpu_areas(void)
>  	of_fill_in_cpu_data();
>  	if (tlb_type == hypervisor)
>  		mdesc_fill_in_cpu_data(cpu_all_mask);
> +	setup_pagesets();
>  }
> Index: linux-2.6/arch/x86/kernel/setup_percpu.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup_percpu.c	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/x86/kernel/setup_percpu.c	2009-10-01 09:35:39.000000000 -0500
> @@ -269,4 +269,6 @@ void __init setup_per_cpu_areas(void)
>  
>  	/* Setup cpu initialized, callin, callout masks */
>  	setup_cpu_local_masks();
> +
> +	setup_pagesets();
>  }
> Index: linux-2.6/mm/percpu.c
> ===================================================================
> --- linux-2.6.orig/mm/percpu.c	2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/mm/percpu.c	2009-10-01 09:35:39.000000000 -0500
> @@ -2062,5 +2062,7 @@ void __init setup_per_cpu_areas(void)
>  	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
>  	for_each_possible_cpu(cpu)
>  		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
> +
> +	setup_pagesets();
>  }
>  #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
> 
> -- 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-02 14:16   ` Mel Gorman
@ 2009-10-02 17:30     ` Christoph Lameter
  2009-10-05  9:35       ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-02 17:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Fri, 2 Oct 2009, Mel Gorman wrote:

> On Thu, Oct 01, 2009 at 05:25:33PM -0400, cl@linux-foundation.org wrote:
> > Explicitly initialize the pagesets after the per cpu areas have been
> > initialized. This is necessary in order to be able to use per cpu
> > operations in later patches.
> >
>
> Can you be more explicit about this? I think the reasoning is as follows
>
> A later patch will use DEFINE_PER_CPU which allocates memory later in
> the boot-cycle after zones have already been initialised. Without this
> patch, use of DEFINE_PER_CPU would result in invalid memory accesses
> during pageset initialisation.

Nope. Pagesets are not statically allocated per cpu data. They are
allocated with the per cpu allocator.

The per cpu allocator is not initialized that early in boot. We cannot
allocate the pagesets then. Therefore we use a fake single item pageset
(like used now for NUMA boot) to take its place until the slab and percpu
allocators are up. Then we allocate the real pagesets.

> > -static __meminit void zone_pcp_init(struct zone *zone)
> > +/*
> > + * Early setup of pagesets.
> > + *
> > + * In the NUMA case the pageset setup simply results in all zones pcp
> > + * pointer being directed at a per cpu pageset with zero batchsize.
> > + *
>
> The batchsize becomes 1, not 0 if you look at setup_pageset() but that aside,
> it's unclear from the comment *why* the batchsize is 1 in the NUMA case.
> Maybe something like the following?
>
> =====
> In the NUMA case, the boot_pageset is used until the slab allocator is
> available to allocate per-zone pagesets as each CPU is brought up. At
> this point, the batchsize is set to 1 to prevent pages "leaking" onto the
> boot_pageset freelists.
> =====
>
> Otherwise, nothing in the patch jumped out at me other than to double
> check CPU-up events actually result in process_zones() being called and
> that boot_pageset is not being accidentally used in the long term.

This is already explained in a commment where boot_pageset is defined.
Should we add some more elaborate comments to zone_pcp_init()?



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-02 17:30     ` Christoph Lameter
@ 2009-10-05  9:35       ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-10-05  9:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Fri, Oct 02, 2009 at 01:30:58PM -0400, Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Mel Gorman wrote:
> 
> > On Thu, Oct 01, 2009 at 05:25:33PM -0400, cl@linux-foundation.org wrote:
> > > Explicitly initialize the pagesets after the per cpu areas have been
> > > initialized. This is necessary in order to be able to use per cpu
> > > operations in later patches.
> > >
> >
> > Can you be more explicit about this? I think the reasoning is as follows
> >
> > A later patch will use DEFINE_PER_CPU which allocates memory later in
> > the boot-cycle after zones have already been initialised. Without this
> > patch, use of DEFINE_PER_CPU would result in invalid memory accesses
> > during pageset initialisation.
> 
> Nope. Pagesets are not statically allocated per cpu data. They are
> allocated with the per cpu allocator.
> 

I don't think I said they were statically allocated.

> The per cpu allocator is not initialized that early in boot. We cannot
> allocate the pagesets then. Therefore we use a fake single item pageset
> (like used now for NUMA boot) to take its place until the slab and percpu
> allocators are up. Then we allocate the real pagesets.
> 

Ok, that explanation matches my expectations. Thanks.

> > > -static __meminit void zone_pcp_init(struct zone *zone)
> > > +/*
> > > + * Early setup of pagesets.
> > > + *
> > > + * In the NUMA case the pageset setup simply results in all zones pcp
> > > + * pointer being directed at a per cpu pageset with zero batchsize.
> > > + *
> >
> > The batchsize becomes 1, not 0 if you look at setup_pageset() but that aside,
> > it's unclear from the comment *why* the batchsize is 1 in the NUMA case.
> > Maybe something like the following?
> >
> > =====
> > In the NUMA case, the boot_pageset is used until the slab allocator is
> > available to allocate per-zone pagesets as each CPU is brought up. At
> > this point, the batchsize is set to 1 to prevent pages "leaking" onto the
> > boot_pageset freelists.
> > =====
> >
> > Otherwise, nothing in the patch jumped out at me other than to double
> > check CPU-up events actually result in process_zones() being called and
> > that boot_pageset is not being accidentally used in the long term.
> 
> This is already explained in a commment where boot_pageset is defined.
> Should we add some more elaborate comments to zone_pcp_init()?
> 

I suppose not. Point them to the comment in boot_pageset so there is a
chance the comment stays up to date.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-01 21:25 ` [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init() cl
  2009-10-02 14:16   ` Mel Gorman
@ 2009-10-03 10:29   ` Tejun Heo
  2009-10-05 14:39     ` Christoph Lameter
  1 sibling, 1 reply; 65+ messages in thread
From: Tejun Heo @ 2009-10-03 10:29 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

cl@linux-foundation.org wrote:
> Explicitly initialize the pagesets after the per cpu areas have been
> initialized. This is necessary in order to be able to use per cpu
> operations in later patches.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  arch/ia64/kernel/setup.c       |    1 +
>  arch/powerpc/kernel/setup_64.c |    1 +
>  arch/sparc/kernel/smp_64.c     |    1 +
>  arch/x86/kernel/setup_percpu.c |    2 ++
>  include/linux/mm.h             |    1 +
>  mm/page_alloc.c                |   40 +++++++++++++++++++++++++++++-----------
>  mm/percpu.c                    |    2 ++

Hmmm... why not call this function from start_kernel() after calling
setup_per_cpu_areas() instead of modifying every implementation of
setup_per_cpu_areas()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-03 10:29   ` Tejun Heo
@ 2009-10-05 14:39     ` Christoph Lameter
  2009-10-05 15:01       ` Tejun Heo
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-05 14:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

On Sat, 3 Oct 2009, Tejun Heo wrote:

> >  arch/ia64/kernel/setup.c       |    1 +
> >  arch/powerpc/kernel/setup_64.c |    1 +
> >  arch/sparc/kernel/smp_64.c     |    1 +
> >  arch/x86/kernel/setup_percpu.c |    2 ++
> >  include/linux/mm.h             |    1 +
> >  mm/page_alloc.c                |   40 +++++++++++++++++++++++++++++-----------
> >  mm/percpu.c                    |    2 ++
>
> Hmmm... why not call this function from start_kernel() after calling
> setup_per_cpu_areas() instead of modifying every implementation of
> setup_per_cpu_areas()?

Because it has to be called immediately after per cpu areas become
available. Otherwise page allocator uses will fail.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-05 14:39     ` Christoph Lameter
@ 2009-10-05 15:01       ` Tejun Heo
  2009-10-05 15:06         ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Tejun Heo @ 2009-10-05 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

Christoph Lameter wrote:
> On Sat, 3 Oct 2009, Tejun Heo wrote:
> 
>>>  arch/ia64/kernel/setup.c       |    1 +
>>>  arch/powerpc/kernel/setup_64.c |    1 +
>>>  arch/sparc/kernel/smp_64.c     |    1 +
>>>  arch/x86/kernel/setup_percpu.c |    2 ++
>>>  include/linux/mm.h             |    1 +
>>>  mm/page_alloc.c                |   40 +++++++++++++++++++++++++++++-----------
>>>  mm/percpu.c                    |    2 ++
>> Hmmm... why not call this function from start_kernel() after calling
>> setup_per_cpu_areas() instead of modifying every implementation of
>> setup_per_cpu_areas()?
> 
> Because it has to be called immediately after per cpu areas become
> available. Otherwise page allocator uses will fail.

But... setup_per_cpu_area() isn't supposed to call page allocator and
start_kernel() is the only caller of setup_per_cpu_areas().

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-05 15:01       ` Tejun Heo
@ 2009-10-05 15:06         ` Christoph Lameter
  2009-10-05 15:21           ` Tejun Heo
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-05 15:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

On Tue, 6 Oct 2009, Tejun Heo wrote:

> > Because it has to be called immediately after per cpu areas become
> > available. Otherwise page allocator uses will fail.
>
> But... setup_per_cpu_area() isn't supposed to call page allocator and
> start_kernel() is the only caller of setup_per_cpu_areas().

setup_per_cpu_areas() is not calling the page allocator. However, any
caller after that can call the page allocator.

There are various arch implementations that do their own implementation of
setup_per_cpu_areas() at their own time (Check sparc and ia64 for
example).



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-05 15:06         ` Christoph Lameter
@ 2009-10-05 15:21           ` Tejun Heo
  2009-10-05 15:28             ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Tejun Heo @ 2009-10-05 15:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Tejun Heo wrote:
> 
>>> Because it has to be called immediately after per cpu areas become
>>> available. Otherwise page allocator uses will fail.
>> But... setup_per_cpu_area() isn't supposed to call page allocator and
>> start_kernel() is the only caller of setup_per_cpu_areas().
> 
> setup_per_cpu_areas() is not calling the page allocator. However, any
> caller after that can call the page allocator.
> 
> There are various arch implementations that do their own implementation of
> setup_per_cpu_areas() at their own time (Check sparc and ia64 for
> example).

sparc is doing everything on setup_per_cpu_areas().  ia64 is an
exception.  Hmmm... I really don't like scattering mostly unrelated
init call to every setup_per_cpu_areas() implementation.  How about
adding "static bool initialized __initdata" to the function and allow
it to be called earlier when necessary (only ia64 at the moment)?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-05 15:21           ` Tejun Heo
@ 2009-10-05 15:28             ` Christoph Lameter
  2009-10-05 15:41               ` Tejun Heo
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-05 15:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

On Tue, 6 Oct 2009, Tejun Heo wrote:

> > There are various arch implementations that do their own implementation of
> > setup_per_cpu_areas() at their own time (Check sparc and ia64 for
> > example).
>
> sparc is doing everything on setup_per_cpu_areas().  ia64 is an
> exception.  Hmmm... I really don't like scattering mostly unrelated
> init call to every setup_per_cpu_areas() implementation.  How about
> adding "static bool initialized __initdata" to the function and allow
> it to be called earlier when necessary (only ia64 at the moment)?

It would be best to consolidate all setup_per_cpu_areas() to work at the
same time during bootup. How about having a single setup_per_cpu_areas()
function and then within the function allow arch specific processing. Try
to share as much code as possible?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-05 15:28             ` Christoph Lameter
@ 2009-10-05 15:41               ` Tejun Heo
  2009-10-05 15:39                 ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Tejun Heo @ 2009-10-05 15:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Tejun Heo wrote:
> 
>>> There are various arch implementations that do their own implementation of
>>> setup_per_cpu_areas() at their own time (Check sparc and ia64 for
>>> example).
>> sparc is doing everything on setup_per_cpu_areas().  ia64 is an
>> exception.  Hmmm... I really don't like scattering mostly unrelated
>> init call to every setup_per_cpu_areas() implementation.  How about
>> adding "static bool initialized __initdata" to the function and allow
>> it to be called earlier when necessary (only ia64 at the moment)?
> 
> It would be best to consolidate all setup_per_cpu_areas() to work at the
> same time during bootup. How about having a single setup_per_cpu_areas()
> function and then within the function allow arch specific processing. Try
> to share as much code as possible?

I'm under the impression that ia64 needs its percpu areas setup
earlier during the boot so I'm not sure what you have in mind but as
long as the call isn't scattered over different setup_per_cpu_areas()
implementations, I'm okay.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init()
  2009-10-05 15:41               ` Tejun Heo
@ 2009-10-05 15:39                 ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-10-05 15:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, linux-kernel, Mel Gorman, mingo, rusty, Pekka Enberg

On Tue, 6 Oct 2009, Tejun Heo wrote:

> I'm under the impression that ia64 needs its percpu areas setup
> earlier during the boot so I'm not sure what you have in mind but as
> long as the call isn't scattered over different setup_per_cpu_areas()
> implementations, I'm okay.

Currently it does. But there must be some way to untangle that mess.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (11 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init() cl
@ 2009-10-01 21:25 ` cl
  2009-10-02 15:14   ` Mel Gorman
  2009-10-01 21:25 ` [this_cpu_xx V4 14/20] this_cpu ops: Remove pageset_notifier cl
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mel Gorman, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_page_allocator --]
[-- Type: text/plain, Size: 12766 bytes --]

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  156 ++++++++++++++-----------------------------------
 mm/vmstat.c            |   14 ++--
 4 files changed, 55 insertions(+), 131 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-09-29 09:30:37.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-09-29 09:30:39.000000000 -0500
@@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
 extern int after_bootmem;
 extern void setup_pagesets(void);
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 extern void zone_pcp_update(struct zone *zone);
 
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-09-29 09:30:25.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h	2009-09-29 09:30:39.000000000 -0500
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-09-29 09:30:37.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-09-29 09:30:50.000000000 -0500
@@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
@@ -1098,7 +1098,7 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
 
 out:
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1183,15 +1182,13 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
 		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
 		if (list_empty(list)) {
@@ -1234,7 +1231,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -2172,7 +2167,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
 }
 
 
-#ifdef CONFIG_NUMA
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -3095,112 +3089,67 @@ static void setup_pagelist_highmark(stru
  * the buddy list. This is safe since pageset manipulation is done
  * with interrupts disabled.
  *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
+ * Some counter updates may also be caught by the boot pagesets.
  *
  * zoneinfo_show() and maybe other functions do
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
 	}
-	return ret;
+	return NOTIFY_OK;
 }
 
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
 
+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used after this call is complete.
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;
+
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);
 
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
+
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }
 
-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3254,7 +3203,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;
 
 		local_irq_save(flags);
@@ -3272,15 +3221,7 @@ void zone_pcp_update(struct zone *zone)
 
 /*
  * Early setup of pagesets.
- *
- * In the NUMA case the pageset setup simply results in all zones pcp
- * pointer being directed at a per cpu pageset with zero batchsize.
- *
- * This means that every free and every allocation occurs directly from
- * the buddy allocator tables.
- *
- * The pageset never queues pages during early boot and is therefore usable
- * for every type of zone.
+ * At this point various allocators are not operational yet.
  */
 __meminit void setup_pagesets(void)
 {
@@ -3288,23 +3229,15 @@ __meminit void setup_pagesets(void)
 	struct zone *zone;
 
 	for_each_zone(zone) {
-#ifdef CONFIG_NUMA
-		unsigned long batch = 0;
-
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			/* Early boot. Slab allocator not functional yet */
-			zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		}
-#else
-		unsigned long batch = zone_batchsize(zone);
-#endif
+		zone->pageset = &per_cpu_var(boot_pageset);
 
+		/*
+		 * Special pagesets with zero elements so that frees
+		 * and allocations are not buffered at all.
+		 */
 		for_each_possible_cpu(cpu)
-			setup_pageset(zone_pcp(zone, cpu), batch);
+			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 0);
 
-		if (zone->present_pages)
-			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-				zone->name, zone->present_pages, batch);
 	}
 }
 
@@ -4818,10 +4751,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-09-29 09:30:25.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-09-29 09:30:43.000000000 -0500
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
 
-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-01 21:25 ` [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion cl
@ 2009-10-02 15:14   ` Mel Gorman
  2009-10-02 17:39     ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-02 15:14 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Thu, Oct 01, 2009 at 05:25:34PM -0400, cl@linux-foundation.org wrote:
> Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> 
> This drastically reduces the size of struct zone for systems with large
> amounts of processors and allows placement of critical variables of struct
> zone in one cacheline even on very large systems.
> 

This seems reasonably accurate. The largest shrink is on !NUMA configured
systems but the NUMA case deletes a lot of pointers.

> Another effect is that the pagesets of one processor are placed near one
> another. If multiple pagesets from different zones fit into one cacheline
> then additional cacheline fetches can be avoided on the hot paths when
> allocating memory from multiple zones.
> 

Out of curiousity, how common an occurance is it that a CPU allocate from
multiple zones? I would have thought it was rare but I never checked
either.

> Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
> are reduced and we can drop the zone_pcp macro.
> 
> Hotplug handling is also simplified since cpu alloc can bring up and
> shut down cpu areas for a specific cpu as a whole. So there is no need to
> allocate or free individual pagesets.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  include/linux/mm.h     |    4 -
>  include/linux/mmzone.h |   12 ---
>  mm/page_alloc.c        |  156 ++++++++++++++-----------------------------------
>  mm/vmstat.c            |   14 ++--
>  4 files changed, 55 insertions(+), 131 deletions(-)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2009-09-29 09:30:37.000000000 -0500
> +++ linux-2.6/include/linux/mm.h	2009-09-29 09:30:39.000000000 -0500
> @@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
>  extern int after_bootmem;
>  extern void setup_pagesets(void);
>  
> -#ifdef CONFIG_NUMA
>  extern void setup_per_cpu_pageset(void);
> -#else
> -static inline void setup_per_cpu_pageset(void) {}
> -#endif
>  
>  extern void zone_pcp_update(struct zone *zone);
>  
> Index: linux-2.6/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmzone.h	2009-09-29 09:30:25.000000000 -0500
> +++ linux-2.6/include/linux/mmzone.h	2009-09-29 09:30:39.000000000 -0500
> @@ -184,13 +184,7 @@ struct per_cpu_pageset {
>  	s8 stat_threshold;
>  	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
>  #endif
> -} ____cacheline_aligned_in_smp;
> -
> -#ifdef CONFIG_NUMA
> -#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
> -#else
> -#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
> -#endif
> +};
>  
>  #endif /* !__GENERATING_BOUNDS.H */
>  
> @@ -306,10 +300,8 @@ struct zone {
>  	 */
>  	unsigned long		min_unmapped_pages;
>  	unsigned long		min_slab_pages;
> -	struct per_cpu_pageset	*pageset[NR_CPUS];
> -#else
> -	struct per_cpu_pageset	pageset[NR_CPUS];
>  #endif
> +	struct per_cpu_pageset	*pageset;
>  	/*
>  	 * free areas of different sizes
>  	 */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2009-09-29 09:30:37.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c	2009-09-29 09:30:50.000000000 -0500
> @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
>  
> -		pset = zone_pcp(zone, cpu);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
>  
>  		pcp = &pset->pcp;
>  		local_irq_save(flags);
> @@ -1098,7 +1098,7 @@ static void free_hot_cold_page(struct pa
>  	arch_free_page(page, 0);
>  	kernel_map_pages(page, 1, 0);
>  
> -	pcp = &zone_pcp(zone, get_cpu())->pcp;
> +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	migratetype = get_pageblock_migratetype(page);
>  	set_page_private(page, migratetype);
>  	local_irq_save(flags);
> @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
>  
>  out:
>  	local_irq_restore(flags);
> -	put_cpu();

Previously we get_cpu() to be preemption safe. We then disable
interrupts and potentially take a spinlock later.

Is the point we disable interrupts a pre-emption point? Even if it's not
on normal kernels, is it a preemption point on the RT kernel? 

If it is a pre-emption point, what stops us getting rescheduled on another
CPU after PCP has been looked up? Sorry if this has been brought up and
resolved already. This is my first proper look at this patchset. The same
query applies to any section that was

get_cpu()
look up PCP structure
disable interrupts
stuff
enable interrupts
put_cpu()

is converted to

this_cpu_ptr() looks up PCP
disable interrupts
enable interrupts

>  }
>  
>  void free_hot_page(struct page *page)
> @@ -1183,15 +1182,13 @@ struct page *buffered_rmqueue(struct zon
>  	unsigned long flags;
>  	struct page *page;
>  	int cold = !!(gfp_flags & __GFP_COLD);
> -	int cpu;
>  
>  again:
> -	cpu  = get_cpu();
>  	if (likely(order == 0)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
>  
> -		pcp = &zone_pcp(zone, cpu)->pcp;
> +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  		list = &pcp->lists[migratetype];
>  		local_irq_save(flags);
>  		if (list_empty(list)) {
> @@ -1234,7 +1231,6 @@ again:
>  	__count_zone_vm_events(PGALLOC, zone, 1 << order);
>  	zone_statistics(preferred_zone, zone);
>  	local_irq_restore(flags);
> -	put_cpu();
>  
>  	VM_BUG_ON(bad_range(zone, page));
>  	if (prep_new_page(page, order, gfp_flags))
> @@ -1243,7 +1239,6 @@ again:
>  
>  failed:
>  	local_irq_restore(flags);
> -	put_cpu();
>  	return NULL;
>  }
>  
> @@ -2172,7 +2167,7 @@ void show_free_areas(void)
>  		for_each_online_cpu(cpu) {
>  			struct per_cpu_pageset *pageset;
>  
> -			pageset = zone_pcp(zone, cpu);
> +			pageset = per_cpu_ptr(zone->pageset, cpu);
>  
>  			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
>  			       cpu, pageset->pcp.high,
> @@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
>  }
>  
>  
> -#ifdef CONFIG_NUMA
>  /*
>   * Boot pageset table. One per cpu which is going to be used for all
>   * zones and all nodes. The parameters will be set in such a way
> @@ -3095,112 +3089,67 @@ static void setup_pagelist_highmark(stru
>   * the buddy list. This is safe since pageset manipulation is done
>   * with interrupts disabled.
>   *
> - * Some NUMA counter updates may also be caught by the boot pagesets.
> - *
> - * The boot_pagesets must be kept even after bootup is complete for
> - * unused processors and/or zones. They do play a role for bootstrapping
> - * hotplugged processors.
> + * Some counter updates may also be caught by the boot pagesets.
>   *
>   * zoneinfo_show() and maybe other functions do
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static struct per_cpu_pageset boot_pageset[NR_CPUS];
> -
> -/*
> - * Dynamically allocate memory for the
> - * per cpu pageset array in struct zone.
> - */
> -static int __cpuinit process_zones(int cpu)
> -{
> -	struct zone *zone, *dzone;
> -	int node = cpu_to_node(cpu);
> -
> -	node_set_state(node, N_CPU);	/* this node has a cpu */
> -
> -	for_each_populated_zone(zone) {
> -		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
> -					 GFP_KERNEL, node);
> -		if (!zone_pcp(zone, cpu))
> -			goto bad;
> -
> -		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
> -
> -		if (percpu_pagelist_fraction)
> -			setup_pagelist_highmark(zone_pcp(zone, cpu),
> -			 	(zone->present_pages / percpu_pagelist_fraction));
> -	}
> -
> -	return 0;
> -bad:
> -	for_each_zone(dzone) {
> -		if (!populated_zone(dzone))
> -			continue;
> -		if (dzone == zone)
> -			break;
> -		kfree(zone_pcp(dzone, cpu));
> -		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
> -	}
> -	return -ENOMEM;
> -}
> -
> -static inline void free_zone_pagesets(int cpu)
> -{
> -	struct zone *zone;
> -
> -	for_each_zone(zone) {
> -		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
> -
> -		/* Free per_cpu_pageset if it is slab allocated */
> -		if (pset != &boot_pageset[cpu])
> -			kfree(pset);
> -		zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -	}
> -}
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  
>  static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
>  		unsigned long action,
>  		void *hcpu)
>  {
>  	int cpu = (long)hcpu;
> -	int ret = NOTIFY_OK;
>  
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> -		if (process_zones(cpu))
> -			ret = NOTIFY_BAD;
> -		break;
> -	case CPU_UP_CANCELED:
> -	case CPU_UP_CANCELED_FROZEN:
> -	case CPU_DEAD:
> -	case CPU_DEAD_FROZEN:
> -		free_zone_pagesets(cpu);
> +		node_set_state(cpu_to_node(cpu), N_CPU);
>  		break;
>  	default:
>  		break;
>  	}
> -	return ret;
> +	return NOTIFY_OK;
>  }
>  
>  static struct notifier_block __cpuinitdata pageset_notifier =
>  	{ &pageset_cpuup_callback, NULL, 0 };
>  
> +/*
> + * Allocate per cpu pagesets and initialize them.
> + * Before this call only boot pagesets were available.
> + * Boot pagesets will no longer be used after this call is complete.

If they are no longer used, do we get the memory back?

> + */
>  void __init setup_per_cpu_pageset(void)
>  {
> -	int err;
> +	struct zone *zone;
> +	int cpu;
> +
> +	for_each_populated_zone(zone) {
> +		zone->pageset = alloc_percpu(struct per_cpu_pageset);
>  
> -	/* Initialize per_cpu_pageset for cpu 0.
> -	 * A cpuup callback will do this for every cpu
> -	 * as it comes online
> +		for_each_possible_cpu(cpu) {
> +			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> +
> +			setup_pageset(pcp, zone_batchsize(zone));
> +
> +			if (percpu_pagelist_fraction)
> +				setup_pagelist_highmark(pcp,
> +					(zone->present_pages /
> +						percpu_pagelist_fraction));
> +		}
> +	}

This would have been easier to review if you left process_zones() where it
was and converted it to the new API. I'm assuming this is just shuffling
code around.

> +
> +	/*
> +	 * The boot cpu is always the first active.
> +	 * The boot node has a processor
>  	 */
> -	err = process_zones(smp_processor_id());
> -	BUG_ON(err);
> +	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
>  	register_cpu_notifier(&pageset_notifier);
>  }
>  
> -#endif
> -
>  static noinline __init_refok
>  int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
>  {
> @@ -3254,7 +3203,7 @@ static int __zone_pcp_update(void *data)
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
>  
> -		pset = zone_pcp(zone, cpu);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
>  		pcp = &pset->pcp;
>  
>  		local_irq_save(flags);
> @@ -3272,15 +3221,7 @@ void zone_pcp_update(struct zone *zone)
>  
>  /*
>   * Early setup of pagesets.
> - *
> - * In the NUMA case the pageset setup simply results in all zones pcp
> - * pointer being directed at a per cpu pageset with zero batchsize.
> - *
> - * This means that every free and every allocation occurs directly from
> - * the buddy allocator tables.
> - *
> - * The pageset never queues pages during early boot and is therefore usable
> - * for every type of zone.
> + * At this point various allocators are not operational yet.
>   */
>  __meminit void setup_pagesets(void)
>  {
> @@ -3288,23 +3229,15 @@ __meminit void setup_pagesets(void)
>  	struct zone *zone;
>  
>  	for_each_zone(zone) {
> -#ifdef CONFIG_NUMA
> -		unsigned long batch = 0;
> -
> -		for (cpu = 0; cpu < NR_CPUS; cpu++) {
> -			/* Early boot. Slab allocator not functional yet */
> -			zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -		}
> -#else
> -		unsigned long batch = zone_batchsize(zone);
> -#endif
> +		zone->pageset = &per_cpu_var(boot_pageset);
>  
> +		/*
> +		 * Special pagesets with zero elements so that frees
> +		 * and allocations are not buffered at all.
> +		 */
>  		for_each_possible_cpu(cpu)
> -			setup_pageset(zone_pcp(zone, cpu), batch);
> +			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 0);
>  
> -		if (zone->present_pages)
> -			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
> -				zone->name, zone->present_pages, batch);
>  	}
>  }
>  
> @@ -4818,10 +4751,11 @@ int percpu_pagelist_fraction_sysctl_hand
>  	if (!write || (ret == -EINVAL))
>  		return ret;
>  	for_each_populated_zone(zone) {
> -		for_each_online_cpu(cpu) {
> +		for_each_possible_cpu(cpu) {
>  			unsigned long  high;
>  			high = zone->present_pages / percpu_pagelist_fraction;
> -			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
> +			setup_pagelist_highmark(
> +				per_cpu_ptr(zone->pageset, cpu), high);
>  		}
>  	}
>  	return 0;
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2009-09-29 09:30:25.000000000 -0500
> +++ linux-2.6/mm/vmstat.c	2009-09-29 09:30:43.000000000 -0500
> @@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
>  		threshold = calculate_threshold(zone);
>  
>  		for_each_online_cpu(cpu)
> -			zone_pcp(zone, cpu)->stat_threshold = threshold;
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
>  	}
>  }
>  
> @@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> +
>  	s8 *p = pcp->vm_stat_diff + item;
>  	long x;
>  
> @@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
>   */
>  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
>  
>  	(*p)++;
> @@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
>  
>  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
>  
>  	(*p)--;
> @@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_pageset *p;
>  
> -		p = zone_pcp(zone, cpu);
> +		p = per_cpu_ptr(zone->pageset, cpu);
>  
>  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>  			if (p->vm_stat_diff[i]) {
> @@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
>  	for_each_online_cpu(i) {
>  		struct per_cpu_pageset *pageset;
>  
> -		pageset = zone_pcp(zone, i);
> +		pageset = per_cpu_ptr(zone->pageset, i);
>  		seq_printf(m,
>  			   "\n    cpu: %i"
>  			   "\n              count: %i"
> 
> -- 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-02 15:14   ` Mel Gorman
@ 2009-10-02 17:39     ` Christoph Lameter
  2009-10-05  9:45       ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-02 17:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Fri, 2 Oct 2009, Mel Gorman wrote:

> On Thu, Oct 01, 2009 at 05:25:34PM -0400, cl@linux-foundation.org wrote:
> > Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> >
> > This drastically reduces the size of struct zone for systems with large
> > amounts of processors and allows placement of critical variables of struct
> > zone in one cacheline even on very large systems.
> >
>
> This seems reasonably accurate. The largest shrink is on !NUMA configured
> systems but the NUMA case deletes a lot of pointers.

True, the !NUMA case will then avoid allocating pagesets for unused
zones. But the NUMA case will have the most benefit since the large arrays
in struct zone are gone. Removing the pagesets from struct zone also
increases the cacheability of struct zone information. This is
particularly useful since the size of the pagesets grew with the addition
of the various types of allocation queues.

> > Another effect is that the pagesets of one processor are placed near one
> > another. If multiple pagesets from different zones fit into one cacheline
> > then additional cacheline fetches can be avoided on the hot paths when
> > allocating memory from multiple zones.
> >
>
> Out of curiousity, how common an occurance is it that a CPU allocate from
> multiple zones? I would have thought it was rare but I never checked
> either.

zone allocations are determined by their use. GFP_KERNEL allocs come from
ZONE_NORMAL whereas typical application pages may come from ZONE_HIGHMEM.
The mix depends on what the kernel and the application are doing.

> >  		pcp = &pset->pcp;
> > +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >  	migratetype = get_pageblock_migratetype(page);
> >  	set_page_private(page, migratetype);
> >  	local_irq_save(flags);
> > @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> >
> >  out:
> >  	local_irq_restore(flags);
> > -	put_cpu();
>
> Previously we get_cpu() to be preemption safe. We then disable
> interrupts and potentially take a spinlock later.

Right. WE need to move the local_irq_save() up two lines. Why disable
preempt and two instructions later disable interrupts? Isnt this bloating
the code?

> this_cpu_ptr() looks up PCP
> disable interrupts
> enable interrupts

Move disable interrupts before the this_cpu_ptr?

> > +/*
> > + * Allocate per cpu pagesets and initialize them.
> > + * Before this call only boot pagesets were available.
> > + * Boot pagesets will no longer be used after this call is complete.
>
> If they are no longer used, do we get the memory back?

No we need to keep them for onlining new processors.

> > -	 * as it comes online
> > +		for_each_possible_cpu(cpu) {
> > +			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> > +
> > +			setup_pageset(pcp, zone_batchsize(zone));
> > +
> > +			if (percpu_pagelist_fraction)
> > +				setup_pagelist_highmark(pcp,
> > +					(zone->present_pages /
> > +						percpu_pagelist_fraction));
> > +		}
> > +	}
>
> This would have been easier to review if you left process_zones() where it
> was and converted it to the new API. I'm assuming this is just shuffling
> code around.

Yes I think this was the result of reducing #ifdefs.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-02 17:39     ` Christoph Lameter
@ 2009-10-05  9:45       ` Mel Gorman
  2009-10-05 14:43         ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-05  9:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Fri, Oct 02, 2009 at 01:39:28PM -0400, Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Mel Gorman wrote:
> 
> > On Thu, Oct 01, 2009 at 05:25:34PM -0400, cl@linux-foundation.org wrote:
> > > Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> > >
> > > This drastically reduces the size of struct zone for systems with large
> > > amounts of processors and allows placement of critical variables of struct
> > > zone in one cacheline even on very large systems.
> > >
> >
> > This seems reasonably accurate. The largest shrink is on !NUMA configured
> > systems but the NUMA case deletes a lot of pointers.
> 
> True, the !NUMA case will then avoid allocating pagesets for unused
> zones. But the NUMA case will have the most benefit since the large arrays
> in struct zone are gone.

Indeed. Out of curiousity, has this patchset been performance tested? I
would expect there to be a small but measurable improvement. If there is
a regression, it might point to poor placement of read/write fields in
the zone.

> Removing the pagesets from struct zone also
> increases the cacheability of struct zone information. This is
> particularly useful since the size of the pagesets grew with the addition
> of the various types of allocation queues.
> 
> > > Another effect is that the pagesets of one processor are placed near one
> > > another. If multiple pagesets from different zones fit into one cacheline
> > > then additional cacheline fetches can be avoided on the hot paths when
> > > allocating memory from multiple zones.
> > >
> >
> > Out of curiousity, how common an occurance is it that a CPU allocate from
> > multiple zones? I would have thought it was rare but I never checked
> > either.
> 
> zone allocations are determined by their use. GFP_KERNEL allocs come from
> ZONE_NORMAL whereas typical application pages may come from ZONE_HIGHMEM.
> The mix depends on what the kernel and the application are doing.
> 

I just wouldn't have expected a significant enough mix to make a measurable
performance difference. It's no biggie.

> > >  		pcp = &pset->pcp;
> > > +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > >  	migratetype = get_pageblock_migratetype(page);
> > >  	set_page_private(page, migratetype);
> > >  	local_irq_save(flags);
> > > @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> > >
> > >  out:
> > >  	local_irq_restore(flags);
> > > -	put_cpu();
> >
> > Previously we get_cpu() to be preemption safe. We then disable
> > interrupts and potentially take a spinlock later.
> 
> Right. WE need to move the local_irq_save() up two lines.

Just so I'm 100% clear, IRQ disabling is considered a preemption point?

> Why disable
> preempt and two instructions later disable interrupts? Isnt this bloating
> the code?
> 

By and large, IRQs are disabled at the last possible moment with the minimum
amount of code in between. While the current location does not make perfect
sense, it was probably many small changes that placed it like this each
person avoiding IRQ-disabling for too long without considering what the cost
of get_cpu() was.

Similar care needs to be taken with the other removals of get_cpu() in
this patch to ensure it's still preemption-safe.

> > this_cpu_ptr() looks up PCP
> > disable interrupts
> > enable interrupts
> 
> Move disable interrupts before the this_cpu_ptr?
> 

In this case, why not move this_cpu_ptr() down until its first use just
before the if (cold) check? It'll still be within the IRQ disabling but
without significantly increasing the amount of time the IRQ is disabled.

> > > +/*
> > > + * Allocate per cpu pagesets and initialize them.
> > > + * Before this call only boot pagesets were available.
> > > + * Boot pagesets will no longer be used after this call is complete.
> >
> > If they are no longer used, do we get the memory back?
> 
> No we need to keep them for onlining new processors.
> 

That comment would appear to disagree.

> > > -	 * as it comes online
> > > +		for_each_possible_cpu(cpu) {
> > > +			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> > > +
> > > +			setup_pageset(pcp, zone_batchsize(zone));
> > > +
> > > +			if (percpu_pagelist_fraction)
> > > +				setup_pagelist_highmark(pcp,
> > > +					(zone->present_pages /
> > > +						percpu_pagelist_fraction));
> > > +		}
> > > +	}
> >
> > This would have been easier to review if you left process_zones() where it
> > was and converted it to the new API. I'm assuming this is just shuffling
> > code around.
> 
> Yes I think this was the result of reducing #ifdefs.
> 

Ok.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-05  9:45       ` Mel Gorman
@ 2009-10-05 14:43         ` Christoph Lameter
  2009-10-05 14:55           ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-05 14:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Mon, 5 Oct 2009, Mel Gorman wrote:

> > Right. WE need to move the local_irq_save() up two lines.
>
> Just so I'm 100% clear, IRQ disabling is considered a preemption point?

Yes.

> > Move disable interrupts before the this_cpu_ptr?
> >
>
> In this case, why not move this_cpu_ptr() down until its first use just
> before the if (cold) check? It'll still be within the IRQ disabling but
> without significantly increasing the amount of time the IRQ is disabled.

Good idea. Ill put that into the next release.

> > > > + * Before this call only boot pagesets were available.
> > > > + * Boot pagesets will no longer be used after this call is complete.
> > >
> > > If they are no longer used, do we get the memory back?
> >
> > No we need to keep them for onlining new processors.
> >
>
> That comment would appear to disagree.

The comment is accurate for a processor. Once the pagesets are allocated
for a processor then the boot pageset is no longer used.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-05 14:43         ` Christoph Lameter
@ 2009-10-05 14:55           ` Christoph Lameter
  2009-10-06  9:45             ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-05 14:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg


Changes to this patch so far:

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-05 09:49:07.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-05 09:48:43.000000000 -0500
@@ -1098,8 +1098,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);

-	local_irq_save(flags);
-	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	if (unlikely(wasMlocked))
@@ -1121,6 +1119,8 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}

+	local_irq_save(flags);
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -3120,7 +3120,8 @@ static struct notifier_block __cpuinitda
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
- * Boot pagesets will no longer be used after this call is complete.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
  */
 void __init setup_per_cpu_pageset(void)
 {
@@ -3232,11 +3233,11 @@ __meminit void setup_pagesets(void)
 		zone->pageset = &per_cpu_var(boot_pageset);

 		/*
-		 * Special pagesets with zero elements so that frees
+		 * Special pagesets with one element so that frees
 		 * and allocations are not buffered at all.
 		 */
 		for_each_possible_cpu(cpu)
-			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 0);
+			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);

 	}
 }


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-05 14:55           ` Christoph Lameter
@ 2009-10-06  9:45             ` Mel Gorman
  2009-10-06 16:34               ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-06  9:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Mon, Oct 05, 2009 at 10:55:49AM -0400, Christoph Lameter wrote:
> 
> Changes to this patch so far:
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2009-10-05 09:49:07.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c	2009-10-05 09:48:43.000000000 -0500
> @@ -1098,8 +1098,6 @@ static void free_hot_cold_page(struct pa
>  	arch_free_page(page, 0);
>  	kernel_map_pages(page, 1, 0);
> 
> -	local_irq_save(flags);
> -	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	migratetype = get_pageblock_migratetype(page);
>  	set_page_private(page, migratetype);
>  	if (unlikely(wasMlocked))

Why did you move local_irq_save() ? It should have stayed where it was
because VM counters are updated under the lock. Only the this_cpu_ptr
should be moving.

> @@ -1121,6 +1119,8 @@ static void free_hot_cold_page(struct pa
>  		migratetype = MIGRATE_MOVABLE;
>  	}
> 
> +	local_irq_save(flags);
> +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	if (cold)
>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
>  	else
> @@ -3120,7 +3120,8 @@ static struct notifier_block __cpuinitda
>  /*
>   * Allocate per cpu pagesets and initialize them.
>   * Before this call only boot pagesets were available.
> - * Boot pagesets will no longer be used after this call is complete.
> + * Boot pagesets will no longer be used by this processorr
> + * after setup_per_cpu_pageset().
>   */
>  void __init setup_per_cpu_pageset(void)
>  {
> @@ -3232,11 +3233,11 @@ __meminit void setup_pagesets(void)
>  		zone->pageset = &per_cpu_var(boot_pageset);
> 
>  		/*
> -		 * Special pagesets with zero elements so that frees
> +		 * Special pagesets with one element so that frees
>  		 * and allocations are not buffered at all.
>  		 */
>  		for_each_possible_cpu(cpu)
> -			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 0);
> +			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);
> 
>  	}
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-06  9:45             ` Mel Gorman
@ 2009-10-06 16:34               ` Christoph Lameter
  2009-10-06 17:03                 ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-06 16:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Tue, 6 Oct 2009, Mel Gorman wrote:

> > -	local_irq_save(flags);
> > -	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >  	migratetype = get_pageblock_migratetype(page);
> >  	set_page_private(page, migratetype);
> >  	if (unlikely(wasMlocked))
>
> Why did you move local_irq_save() ? It should have stayed where it was
> because VM counters are updated under the lock. Only the this_cpu_ptr
> should be moving.

The __count_vm_event()? VM counters may be incremented in a racy way if
convenient. x86 usually produces non racy code (and with this patchset
will always produce non racy code) but f.e. IA64 has always had racy
updates. I'd rather shorted the irq off section.

See the comment in vmstat.h.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-06 16:34               ` Christoph Lameter
@ 2009-10-06 17:03                 ` Mel Gorman
  2009-10-06 17:51                   ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-06 17:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Tue, Oct 06, 2009 at 12:34:56PM -0400, Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
> 
> > > -	local_irq_save(flags);
> > > -	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > >  	migratetype = get_pageblock_migratetype(page);
> > >  	set_page_private(page, migratetype);
> > >  	if (unlikely(wasMlocked))
> >
> > Why did you move local_irq_save() ? It should have stayed where it was
> > because VM counters are updated under the lock. Only the this_cpu_ptr
> > should be moving.
> 
> The __count_vm_event()?

and the __dec_zone_page_state within free_page_mlock(). However, it's already
atomic so it shouldn't be a problem.

> VM counters may be incremented in a racy way if
> convenient. x86 usually produces non racy code (and with this patchset
> will always produce non racy code) but f.e. IA64 has always had racy
> updates. I'd rather shorted the irq off section.
> 

The count_vm_event is now racier than it was and no longer symmetric with
the PGALLOC counting which still happens with IRQs disabled. The assymetry
could look very strange if there are a lot more frees than allocs for example
because the raciness between the counters is difference.

While I have no problem as such with the local_irq_save() moving (although
I would like PGFREE and PGALLOC to be accounted both with or without IRQs
enabled), I think it deserves to be in a patch all to itself and not hidden
in an apparently unrelated change.

> See the comment in vmstat.h.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-06 17:03                 ` Mel Gorman
@ 2009-10-06 17:51                   ` Christoph Lameter
  2009-10-06 18:36                     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-06 17:51 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Tue, 6 Oct 2009, Mel Gorman wrote:

> While I have no problem as such with the local_irq_save() moving (although
> I would like PGFREE and PGALLOC to be accounted both with or without IRQs
> enabled), I think it deserves to be in a patch all to itself and not hidden
> in an apparently unrelated change.

Ok I have moved the local_irq_save back.

Full patch (will push it back into  my git tree if you approve)

From: Christoph Lameter <cl@linux-foundation.org>
Subject: this_cpu_ops: page allocator conversion

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  157 ++++++++++++++-----------------------------------
 mm/vmstat.c            |   14 ++--
 4 files changed, 56 insertions(+), 131 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-10-06 12:41:19.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-10-06 12:41:19.000000000 -0500
@@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
 extern int after_bootmem;
 extern void setup_pagesets(void);

-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif

 extern void zone_pcp_update(struct zone *zone);

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-10-05 15:33:08.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h	2009-10-06 12:41:19.000000000 -0500
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};

 #endif /* !__GENERATING_BOUNDS.H */

@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-06 12:41:19.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-06 12:43:27.000000000 -0500
@@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;

-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);

 		pcp = &pset->pcp;
 		local_irq_save(flags);
@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);

-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}

+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa

 out:
 	local_irq_restore(flags);
-	put_cpu();
 }

 void free_hot_page(struct page *page)
@@ -1183,15 +1182,13 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;

 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;

-		pcp = &zone_pcp(zone, cpu)->pcp;
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
 		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
 		if (list_empty(list)) {
@@ -1234,7 +1231,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();

 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:

 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }

@@ -2172,7 +2167,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;

-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);

 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
 }


-#ifdef CONFIG_NUMA
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -3095,112 +3089,68 @@ static void setup_pagelist_highmark(stru
  * the buddy list. This is safe since pageset manipulation is done
  * with interrupts disabled.
  *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
+ * Some counter updates may also be caught by the boot pagesets.
  *
  * zoneinfo_show() and maybe other functions do
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);

 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;

 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
 	}
-	return ret;
+	return NOTIFY_OK;
 }

 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };

+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;
+
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);

-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
+
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }

-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3254,7 +3204,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;

-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;

 		local_irq_save(flags);
@@ -3272,15 +3222,7 @@ void zone_pcp_update(struct zone *zone)

 /*
  * Early setup of pagesets.
- *
- * In the NUMA case the pageset setup simply results in all zones pcp
- * pointer being directed at a per cpu pageset with zero batchsize.
- *
- * This means that every free and every allocation occurs directly from
- * the buddy allocator tables.
- *
- * The pageset never queues pages during early boot and is therefore usable
- * for every type of zone.
+ * At this point various allocators are not operational yet.
  */
 __meminit void setup_pagesets(void)
 {
@@ -3288,23 +3230,15 @@ __meminit void setup_pagesets(void)
 	struct zone *zone;

 	for_each_zone(zone) {
-#ifdef CONFIG_NUMA
-		unsigned long batch = 0;
-
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			/* Early boot. Slab allocator not functional yet */
-			zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		}
-#else
-		unsigned long batch = zone_batchsize(zone);
-#endif
+		zone->pageset = &per_cpu_var(boot_pageset);

+		/*
+		 * Special pagesets with one element so that frees
+		 * and allocations are not buffered at all.
+		 */
 		for_each_possible_cpu(cpu)
-			setup_pageset(zone_pcp(zone, cpu), batch);
+			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);

-		if (zone->present_pages)
-			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-				zone->name, zone->present_pages, batch);
 	}
 }

@@ -4818,10 +4752,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-05 15:33:08.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-06 12:43:22.000000000 -0500
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);

 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }

@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;

@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;

 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);

 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;

 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;

-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);

 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;

-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-06 17:51                   ` Christoph Lameter
@ 2009-10-06 18:36                     ` Mel Gorman
  2009-10-06 19:06                       ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-06 18:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Tue, Oct 06, 2009 at 01:51:38PM -0400, Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
> 
> > While I have no problem as such with the local_irq_save() moving (although
> > I would like PGFREE and PGALLOC to be accounted both with or without IRQs
> > enabled), I think it deserves to be in a patch all to itself and not hidden
> > in an apparently unrelated change.
> 
> Ok I have moved the local_irq_save back.
> 

Thanks.

> Full patch (will push it back into  my git tree if you approve)
> 

Few more comments I'm afraid :(

> From: Christoph Lameter <cl@linux-foundation.org>
> Subject: this_cpu_ops: page allocator conversion
> 
> Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> 
> This drastically reduces the size of struct zone for systems with large
> amounts of processors and allows placement of critical variables of struct
> zone in one cacheline even on very large systems.
> 
> Another effect is that the pagesets of one processor are placed near one
> another. If multiple pagesets from different zones fit into one cacheline
> then additional cacheline fetches can be avoided on the hot paths when
> allocating memory from multiple zones.
> 
> Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
> are reduced and we can drop the zone_pcp macro.
> 
> Hotplug handling is also simplified since cpu alloc can bring up and
> shut down cpu areas for a specific cpu as a whole. So there is no need to
> allocate or free individual pagesets.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  include/linux/mm.h     |    4 -
>  include/linux/mmzone.h |   12 ---
>  mm/page_alloc.c        |  157 ++++++++++++++-----------------------------------
>  mm/vmstat.c            |   14 ++--
>  4 files changed, 56 insertions(+), 131 deletions(-)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2009-10-06 12:41:19.000000000 -0500
> +++ linux-2.6/include/linux/mm.h	2009-10-06 12:41:19.000000000 -0500
> @@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
>  extern int after_bootmem;
>  extern void setup_pagesets(void);
> 
> -#ifdef CONFIG_NUMA
>  extern void setup_per_cpu_pageset(void);
> -#else
> -static inline void setup_per_cpu_pageset(void) {}
> -#endif
> 
>  extern void zone_pcp_update(struct zone *zone);
> 
> Index: linux-2.6/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmzone.h	2009-10-05 15:33:08.000000000 -0500
> +++ linux-2.6/include/linux/mmzone.h	2009-10-06 12:41:19.000000000 -0500
> @@ -184,13 +184,7 @@ struct per_cpu_pageset {
>  	s8 stat_threshold;
>  	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
>  #endif
> -} ____cacheline_aligned_in_smp;
> -
> -#ifdef CONFIG_NUMA
> -#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
> -#else
> -#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
> -#endif
> +};
> 
>  #endif /* !__GENERATING_BOUNDS.H */
> 
> @@ -306,10 +300,8 @@ struct zone {
>  	 */
>  	unsigned long		min_unmapped_pages;
>  	unsigned long		min_slab_pages;
> -	struct per_cpu_pageset	*pageset[NR_CPUS];
> -#else
> -	struct per_cpu_pageset	pageset[NR_CPUS];
>  #endif
> +	struct per_cpu_pageset	*pageset;
>  	/*
>  	 * free areas of different sizes
>  	 */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2009-10-06 12:41:19.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c	2009-10-06 12:43:27.000000000 -0500
> @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
> 
> -		pset = zone_pcp(zone, cpu);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
> 
>  		pcp = &pset->pcp;
>  		local_irq_save(flags);

It's not your fault and it doesn't actually matter to the current callers
of drain_pages, but you might as well move the per_cpu_ptr inside the
local_irq_save() here as well while you're changing here.

> @@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
>  	arch_free_page(page, 0);
>  	kernel_map_pages(page, 1, 0);
> 
> -	pcp = &zone_pcp(zone, get_cpu())->pcp;
>  	migratetype = get_pageblock_migratetype(page);
>  	set_page_private(page, migratetype);
>  	local_irq_save(flags);
> @@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
>  		migratetype = MIGRATE_MOVABLE;
>  	}
> 
> +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	if (cold)
>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
>  	else
> @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> 
>  out:
>  	local_irq_restore(flags);
> -	put_cpu();
>  }
> 
>  void free_hot_page(struct page *page)
> @@ -1183,15 +1182,13 @@ struct page *buffered_rmqueue(struct zon
>  	unsigned long flags;
>  	struct page *page;
>  	int cold = !!(gfp_flags & __GFP_COLD);
> -	int cpu;
> 
>  again:
> -	cpu  = get_cpu();
>  	if (likely(order == 0)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
> 
> -		pcp = &zone_pcp(zone, cpu)->pcp;
> +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  		list = &pcp->lists[migratetype];
>  		local_irq_save(flags);

I believe this falls foul of the same problem as in the free path. We
are no longer preempt safe and this_cpu_ptr() needs to move within the
local_irq_save().

I didn't spot anything out of the ordinary after this but I haven't tested
the series.

>  		if (list_empty(list)) {
> @@ -1234,7 +1231,6 @@ again:
>  	__count_zone_vm_events(PGALLOC, zone, 1 << order);
>  	zone_statistics(preferred_zone, zone);
>  	local_irq_restore(flags);
> -	put_cpu();
> 
>  	VM_BUG_ON(bad_range(zone, page));
>  	if (prep_new_page(page, order, gfp_flags))
> @@ -1243,7 +1239,6 @@ again:
> 
>  failed:
>  	local_irq_restore(flags);
> -	put_cpu();
>  	return NULL;
>  }
> 
> @@ -2172,7 +2167,7 @@ void show_free_areas(void)
>  		for_each_online_cpu(cpu) {
>  			struct per_cpu_pageset *pageset;
> 
> -			pageset = zone_pcp(zone, cpu);
> +			pageset = per_cpu_ptr(zone->pageset, cpu);
> 
>  			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
>  			       cpu, pageset->pcp.high,
> @@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
>  }
> 
> 
> -#ifdef CONFIG_NUMA
>  /*
>   * Boot pageset table. One per cpu which is going to be used for all
>   * zones and all nodes. The parameters will be set in such a way
> @@ -3095,112 +3089,68 @@ static void setup_pagelist_highmark(stru
>   * the buddy list. This is safe since pageset manipulation is done
>   * with interrupts disabled.
>   *
> - * Some NUMA counter updates may also be caught by the boot pagesets.
> - *
> - * The boot_pagesets must be kept even after bootup is complete for
> - * unused processors and/or zones. They do play a role for bootstrapping
> - * hotplugged processors.
> + * Some counter updates may also be caught by the boot pagesets.
>   *
>   * zoneinfo_show() and maybe other functions do
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static struct per_cpu_pageset boot_pageset[NR_CPUS];
> -
> -/*
> - * Dynamically allocate memory for the
> - * per cpu pageset array in struct zone.
> - */
> -static int __cpuinit process_zones(int cpu)
> -{
> -	struct zone *zone, *dzone;
> -	int node = cpu_to_node(cpu);
> -
> -	node_set_state(node, N_CPU);	/* this node has a cpu */
> -
> -	for_each_populated_zone(zone) {
> -		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
> -					 GFP_KERNEL, node);
> -		if (!zone_pcp(zone, cpu))
> -			goto bad;
> -
> -		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
> -
> -		if (percpu_pagelist_fraction)
> -			setup_pagelist_highmark(zone_pcp(zone, cpu),
> -			 	(zone->present_pages / percpu_pagelist_fraction));
> -	}
> -
> -	return 0;
> -bad:
> -	for_each_zone(dzone) {
> -		if (!populated_zone(dzone))
> -			continue;
> -		if (dzone == zone)
> -			break;
> -		kfree(zone_pcp(dzone, cpu));
> -		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
> -	}
> -	return -ENOMEM;
> -}
> -
> -static inline void free_zone_pagesets(int cpu)
> -{
> -	struct zone *zone;
> -
> -	for_each_zone(zone) {
> -		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
> -
> -		/* Free per_cpu_pageset if it is slab allocated */
> -		if (pset != &boot_pageset[cpu])
> -			kfree(pset);
> -		zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -	}
> -}
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> 
>  static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
>  		unsigned long action,
>  		void *hcpu)
>  {
>  	int cpu = (long)hcpu;
> -	int ret = NOTIFY_OK;
> 
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> -		if (process_zones(cpu))
> -			ret = NOTIFY_BAD;
> -		break;
> -	case CPU_UP_CANCELED:
> -	case CPU_UP_CANCELED_FROZEN:
> -	case CPU_DEAD:
> -	case CPU_DEAD_FROZEN:
> -		free_zone_pagesets(cpu);
> +		node_set_state(cpu_to_node(cpu), N_CPU);
>  		break;
>  	default:
>  		break;
>  	}
> -	return ret;
> +	return NOTIFY_OK;
>  }
> 
>  static struct notifier_block __cpuinitdata pageset_notifier =
>  	{ &pageset_cpuup_callback, NULL, 0 };
> 
> +/*
> + * Allocate per cpu pagesets and initialize them.
> + * Before this call only boot pagesets were available.
> + * Boot pagesets will no longer be used by this processorr
> + * after setup_per_cpu_pageset().
> + */
>  void __init setup_per_cpu_pageset(void)
>  {
> -	int err;
> +	struct zone *zone;
> +	int cpu;
> +
> +	for_each_populated_zone(zone) {
> +		zone->pageset = alloc_percpu(struct per_cpu_pageset);
> 
> -	/* Initialize per_cpu_pageset for cpu 0.
> -	 * A cpuup callback will do this for every cpu
> -	 * as it comes online
> +		for_each_possible_cpu(cpu) {
> +			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> +
> +			setup_pageset(pcp, zone_batchsize(zone));
> +
> +			if (percpu_pagelist_fraction)
> +				setup_pagelist_highmark(pcp,
> +					(zone->present_pages /
> +						percpu_pagelist_fraction));
> +		}
> +	}
> +
> +	/*
> +	 * The boot cpu is always the first active.
> +	 * The boot node has a processor
>  	 */
> -	err = process_zones(smp_processor_id());
> -	BUG_ON(err);
> +	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
>  	register_cpu_notifier(&pageset_notifier);
>  }
> 
> -#endif
> -
>  static noinline __init_refok
>  int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
>  {
> @@ -3254,7 +3204,7 @@ static int __zone_pcp_update(void *data)
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
> 
> -		pset = zone_pcp(zone, cpu);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
>  		pcp = &pset->pcp;
> 
>  		local_irq_save(flags);
> @@ -3272,15 +3222,7 @@ void zone_pcp_update(struct zone *zone)
> 
>  /*
>   * Early setup of pagesets.
> - *
> - * In the NUMA case the pageset setup simply results in all zones pcp
> - * pointer being directed at a per cpu pageset with zero batchsize.
> - *
> - * This means that every free and every allocation occurs directly from
> - * the buddy allocator tables.
> - *
> - * The pageset never queues pages during early boot and is therefore usable
> - * for every type of zone.
> + * At this point various allocators are not operational yet.
>   */
>  __meminit void setup_pagesets(void)
>  {
> @@ -3288,23 +3230,15 @@ __meminit void setup_pagesets(void)
>  	struct zone *zone;
> 
>  	for_each_zone(zone) {
> -#ifdef CONFIG_NUMA
> -		unsigned long batch = 0;
> -
> -		for (cpu = 0; cpu < NR_CPUS; cpu++) {
> -			/* Early boot. Slab allocator not functional yet */
> -			zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -		}
> -#else
> -		unsigned long batch = zone_batchsize(zone);
> -#endif
> +		zone->pageset = &per_cpu_var(boot_pageset);
> 
> +		/*
> +		 * Special pagesets with one element so that frees
> +		 * and allocations are not buffered at all.
> +		 */
>  		for_each_possible_cpu(cpu)
> -			setup_pageset(zone_pcp(zone, cpu), batch);
> +			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);
> 
> -		if (zone->present_pages)
> -			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
> -				zone->name, zone->present_pages, batch);
>  	}
>  }
> 
> @@ -4818,10 +4752,11 @@ int percpu_pagelist_fraction_sysctl_hand
>  	if (!write || (ret == -EINVAL))
>  		return ret;
>  	for_each_populated_zone(zone) {
> -		for_each_online_cpu(cpu) {
> +		for_each_possible_cpu(cpu) {
>  			unsigned long  high;
>  			high = zone->present_pages / percpu_pagelist_fraction;
> -			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
> +			setup_pagelist_highmark(
> +				per_cpu_ptr(zone->pageset, cpu), high);
>  		}
>  	}
>  	return 0;
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2009-10-05 15:33:08.000000000 -0500
> +++ linux-2.6/mm/vmstat.c	2009-10-06 12:43:22.000000000 -0500
> @@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
>  		threshold = calculate_threshold(zone);
> 
>  		for_each_online_cpu(cpu)
> -			zone_pcp(zone, cpu)->stat_threshold = threshold;
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
>  	}
>  }
> 
> @@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> +
>  	s8 *p = pcp->vm_stat_diff + item;
>  	long x;
> 
> @@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
>   */
>  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
> 
>  	(*p)++;
> @@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
> 
>  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
> 
>  	(*p)--;
> @@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_pageset *p;
> 
> -		p = zone_pcp(zone, cpu);
> +		p = per_cpu_ptr(zone->pageset, cpu);
> 
>  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>  			if (p->vm_stat_diff[i]) {
> @@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
>  	for_each_online_cpu(i) {
>  		struct per_cpu_pageset *pageset;
> 
> -		pageset = zone_pcp(zone, i);
> +		pageset = per_cpu_ptr(zone->pageset, i);
>  		seq_printf(m,
>  			   "\n    cpu: %i"
>  			   "\n              count: %i"
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-06 18:36                     ` Mel Gorman
@ 2009-10-06 19:06                       ` Christoph Lameter
  2009-10-07 10:42                         ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-06 19:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Tue, 6 Oct 2009, Mel Gorman wrote:

> > --- linux-2.6.orig/mm/page_alloc.c	2009-10-06 12:41:19.000000000 -0500
> > +++ linux-2.6/mm/page_alloc.c	2009-10-06 12:43:27.000000000 -0500
> > @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
> >  		struct per_cpu_pageset *pset;
> >  		struct per_cpu_pages *pcp;
> >
> > -		pset = zone_pcp(zone, cpu);
> > +		pset = per_cpu_ptr(zone->pageset, cpu);
> >
> >  		pcp = &pset->pcp;
> >  		local_irq_save(flags);
>
> It's not your fault and it doesn't actually matter to the current callers
> of drain_pages, but you might as well move the per_cpu_ptr inside the
> local_irq_save() here as well while you're changing here.

The comments before drain_pages() clearly state that the caller must be
pinned to a processor. But lets change it for consistencies sake.

> > -	cpu  = get_cpu();
> >  	if (likely(order == 0)) {
> >  		struct per_cpu_pages *pcp;
> >  		struct list_head *list;
> >
> > -		pcp = &zone_pcp(zone, cpu)->pcp;
> > +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >  		list = &pcp->lists[migratetype];
> >  		local_irq_save(flags);
>
> I believe this falls foul of the same problem as in the free path. We
> are no longer preempt safe and this_cpu_ptr() needs to move within the
> local_irq_save().

Ok.

From: Christoph Lameter <cl@linux-foundation.org>
Subject: this_cpu_ops: page allocator conversion

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  161 ++++++++++++++-----------------------------------
 mm/vmstat.c            |   14 ++--
 4 files changed, 58 insertions(+), 133 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-10-06 13:54:25.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-10-06 13:54:25.000000000 -0500
@@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
 extern int after_bootmem;
 extern void setup_pagesets(void);

-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif

 extern void zone_pcp_update(struct zone *zone);

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-10-06 12:48:46.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h	2009-10-06 13:54:25.000000000 -0500
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};

 #endif /* !__GENERATING_BOUNDS.H */

@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-06 13:54:25.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-06 13:59:27.000000000 -0500
@@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;

-		pset = zone_pcp(zone, cpu);
+		local_irq_save(flags);
+		pset = per_cpu_ptr(zone->pageset, cpu);

 		pcp = &pset->pcp;
-		local_irq_save(flags);
 		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);

-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}

+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa

 out:
 	local_irq_restore(flags);
-	put_cpu();
 }

 void free_hot_page(struct page *page)
@@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;

 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;

-		pcp = &zone_pcp(zone, cpu)->pcp;
-		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		list = &pcp->lists[migratetype];
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
@@ -1234,7 +1231,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();

 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:

 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }

@@ -2172,7 +2167,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;

-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);

 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
 }


-#ifdef CONFIG_NUMA
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -3095,112 +3089,68 @@ static void setup_pagelist_highmark(stru
  * the buddy list. This is safe since pageset manipulation is done
  * with interrupts disabled.
  *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
+ * Some counter updates may also be caught by the boot pagesets.
  *
  * zoneinfo_show() and maybe other functions do
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);

 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;

 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
 	}
-	return ret;
+	return NOTIFY_OK;
 }

 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };

+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;

-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);
+
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
+
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }

-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3254,7 +3204,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;

-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;

 		local_irq_save(flags);
@@ -3272,15 +3222,7 @@ void zone_pcp_update(struct zone *zone)

 /*
  * Early setup of pagesets.
- *
- * In the NUMA case the pageset setup simply results in all zones pcp
- * pointer being directed at a per cpu pageset with zero batchsize.
- *
- * This means that every free and every allocation occurs directly from
- * the buddy allocator tables.
- *
- * The pageset never queues pages during early boot and is therefore usable
- * for every type of zone.
+ * At this point various allocators are not operational yet.
  */
 __meminit void setup_pagesets(void)
 {
@@ -3288,23 +3230,15 @@ __meminit void setup_pagesets(void)
 	struct zone *zone;

 	for_each_zone(zone) {
-#ifdef CONFIG_NUMA
-		unsigned long batch = 0;
-
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			/* Early boot. Slab allocator not functional yet */
-			zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		}
-#else
-		unsigned long batch = zone_batchsize(zone);
-#endif
+		zone->pageset = &per_cpu_var(boot_pageset);

+		/*
+		 * Special pagesets with one element so that frees
+		 * and allocations are not buffered at all.
+		 */
 		for_each_possible_cpu(cpu)
-			setup_pageset(zone_pcp(zone, cpu), batch);
+			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);

-		if (zone->present_pages)
-			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-				zone->name, zone->present_pages, batch);
 	}
 }

@@ -4818,10 +4752,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-06 12:48:46.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-06 13:59:23.000000000 -0500
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);

 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }

@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;

@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;

 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);

 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;

 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;

-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);

 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;

-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion
  2009-10-06 19:06                       ` Christoph Lameter
@ 2009-10-07 10:42                         ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-10-07 10:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

On Tue, Oct 06, 2009 at 03:06:27PM -0400, Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
> 
> > > --- linux-2.6.orig/mm/page_alloc.c	2009-10-06 12:41:19.000000000 -0500
> > > +++ linux-2.6/mm/page_alloc.c	2009-10-06 12:43:27.000000000 -0500
> > > @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
> > >  		struct per_cpu_pageset *pset;
> > >  		struct per_cpu_pages *pcp;
> > >
> > > -		pset = zone_pcp(zone, cpu);
> > > +		pset = per_cpu_ptr(zone->pageset, cpu);
> > >
> > >  		pcp = &pset->pcp;
> > >  		local_irq_save(flags);
> >
> > It's not your fault and it doesn't actually matter to the current callers
> > of drain_pages, but you might as well move the per_cpu_ptr inside the
> > local_irq_save() here as well while you're changing here.
> 
> The comments before drain_pages() clearly state that the caller must be
> pinned to a processor. But lets change it for consistencies sake.
> 

I noted the comment all right hence me saying that it doesn't matter to
the current callers because they obey the rules.

It was consistency I was looking at but I should have kept quiet because
there are a few oddities like this. It doesn't hurt to fix though.

> > > -	cpu  = get_cpu();
> > >  	if (likely(order == 0)) {
> > >  		struct per_cpu_pages *pcp;
> > >  		struct list_head *list;
> > >
> > > -		pcp = &zone_pcp(zone, cpu)->pcp;
> > > +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > >  		list = &pcp->lists[migratetype];
> > >  		local_irq_save(flags);
> >
> > I believe this falls foul of the same problem as in the free path. We
> > are no longer preempt safe and this_cpu_ptr() needs to move within the
> > local_irq_save().
> 
> Ok.
> 
> From: Christoph Lameter <cl@linux-foundation.org>
> Subject: this_cpu_ops: page allocator conversion
> 
> Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> 
> This drastically reduces the size of struct zone for systems with large
> amounts of processors and allows placement of critical variables of struct
> zone in one cacheline even on very large systems.
> 
> Another effect is that the pagesets of one processor are placed near one
> another. If multiple pagesets from different zones fit into one cacheline
> then additional cacheline fetches can be avoided on the hot paths when
> allocating memory from multiple zones.
> 
> Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
> are reduced and we can drop the zone_pcp macro.
> 
> Hotplug handling is also simplified since cpu alloc can bring up and
> shut down cpu areas for a specific cpu as a whole. So there is no need to
> allocate or free individual pagesets.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 

I can't see anything else to complain about. Performance figures would
be nice but otherwise

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

Thanks

> ---
>  include/linux/mm.h     |    4 -
>  include/linux/mmzone.h |   12 ---
>  mm/page_alloc.c        |  161 ++++++++++++++-----------------------------------
>  mm/vmstat.c            |   14 ++--
>  4 files changed, 58 insertions(+), 133 deletions(-)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2009-10-06 13:54:25.000000000 -0500
> +++ linux-2.6/include/linux/mm.h	2009-10-06 13:54:25.000000000 -0500
> @@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
>  extern int after_bootmem;
>  extern void setup_pagesets(void);
> 
> -#ifdef CONFIG_NUMA
>  extern void setup_per_cpu_pageset(void);
> -#else
> -static inline void setup_per_cpu_pageset(void) {}
> -#endif
> 
>  extern void zone_pcp_update(struct zone *zone);
> 
> Index: linux-2.6/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmzone.h	2009-10-06 12:48:46.000000000 -0500
> +++ linux-2.6/include/linux/mmzone.h	2009-10-06 13:54:25.000000000 -0500
> @@ -184,13 +184,7 @@ struct per_cpu_pageset {
>  	s8 stat_threshold;
>  	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
>  #endif
> -} ____cacheline_aligned_in_smp;
> -
> -#ifdef CONFIG_NUMA
> -#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
> -#else
> -#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
> -#endif
> +};
> 
>  #endif /* !__GENERATING_BOUNDS.H */
> 
> @@ -306,10 +300,8 @@ struct zone {
>  	 */
>  	unsigned long		min_unmapped_pages;
>  	unsigned long		min_slab_pages;
> -	struct per_cpu_pageset	*pageset[NR_CPUS];
> -#else
> -	struct per_cpu_pageset	pageset[NR_CPUS];
>  #endif
> +	struct per_cpu_pageset	*pageset;
>  	/*
>  	 * free areas of different sizes
>  	 */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2009-10-06 13:54:25.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c	2009-10-06 13:59:27.000000000 -0500
> @@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
> 
> -		pset = zone_pcp(zone, cpu);
> +		local_irq_save(flags);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
> 
>  		pcp = &pset->pcp;
> -		local_irq_save(flags);
>  		free_pcppages_bulk(zone, pcp->count, pcp);
>  		pcp->count = 0;
>  		local_irq_restore(flags);
> @@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
>  	arch_free_page(page, 0);
>  	kernel_map_pages(page, 1, 0);
> 
> -	pcp = &zone_pcp(zone, get_cpu())->pcp;
>  	migratetype = get_pageblock_migratetype(page);
>  	set_page_private(page, migratetype);
>  	local_irq_save(flags);
> @@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
>  		migratetype = MIGRATE_MOVABLE;
>  	}
> 
> +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	if (cold)
>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
>  	else
> @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> 
>  out:
>  	local_irq_restore(flags);
> -	put_cpu();
>  }
> 
>  void free_hot_page(struct page *page)
> @@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
>  	unsigned long flags;
>  	struct page *page;
>  	int cold = !!(gfp_flags & __GFP_COLD);
> -	int cpu;
> 
>  again:
> -	cpu  = get_cpu();
>  	if (likely(order == 0)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
> 
> -		pcp = &zone_pcp(zone, cpu)->pcp;
> -		list = &pcp->lists[migratetype];
>  		local_irq_save(flags);
> +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> +		list = &pcp->lists[migratetype];
>  		if (list_empty(list)) {
>  			pcp->count += rmqueue_bulk(zone, 0,
>  					pcp->batch, list,
> @@ -1234,7 +1231,6 @@ again:
>  	__count_zone_vm_events(PGALLOC, zone, 1 << order);
>  	zone_statistics(preferred_zone, zone);
>  	local_irq_restore(flags);
> -	put_cpu();
> 
>  	VM_BUG_ON(bad_range(zone, page));
>  	if (prep_new_page(page, order, gfp_flags))
> @@ -1243,7 +1239,6 @@ again:
> 
>  failed:
>  	local_irq_restore(flags);
> -	put_cpu();
>  	return NULL;
>  }
> 
> @@ -2172,7 +2167,7 @@ void show_free_areas(void)
>  		for_each_online_cpu(cpu) {
>  			struct per_cpu_pageset *pageset;
> 
> -			pageset = zone_pcp(zone, cpu);
> +			pageset = per_cpu_ptr(zone->pageset, cpu);
> 
>  			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
>  			       cpu, pageset->pcp.high,
> @@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
>  }
> 
> 
> -#ifdef CONFIG_NUMA
>  /*
>   * Boot pageset table. One per cpu which is going to be used for all
>   * zones and all nodes. The parameters will be set in such a way
> @@ -3095,112 +3089,68 @@ static void setup_pagelist_highmark(stru
>   * the buddy list. This is safe since pageset manipulation is done
>   * with interrupts disabled.
>   *
> - * Some NUMA counter updates may also be caught by the boot pagesets.
> - *
> - * The boot_pagesets must be kept even after bootup is complete for
> - * unused processors and/or zones. They do play a role for bootstrapping
> - * hotplugged processors.
> + * Some counter updates may also be caught by the boot pagesets.
>   *
>   * zoneinfo_show() and maybe other functions do
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static struct per_cpu_pageset boot_pageset[NR_CPUS];
> -
> -/*
> - * Dynamically allocate memory for the
> - * per cpu pageset array in struct zone.
> - */
> -static int __cpuinit process_zones(int cpu)
> -{
> -	struct zone *zone, *dzone;
> -	int node = cpu_to_node(cpu);
> -
> -	node_set_state(node, N_CPU);	/* this node has a cpu */
> -
> -	for_each_populated_zone(zone) {
> -		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
> -					 GFP_KERNEL, node);
> -		if (!zone_pcp(zone, cpu))
> -			goto bad;
> -
> -		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
> -
> -		if (percpu_pagelist_fraction)
> -			setup_pagelist_highmark(zone_pcp(zone, cpu),
> -			 	(zone->present_pages / percpu_pagelist_fraction));
> -	}
> -
> -	return 0;
> -bad:
> -	for_each_zone(dzone) {
> -		if (!populated_zone(dzone))
> -			continue;
> -		if (dzone == zone)
> -			break;
> -		kfree(zone_pcp(dzone, cpu));
> -		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
> -	}
> -	return -ENOMEM;
> -}
> -
> -static inline void free_zone_pagesets(int cpu)
> -{
> -	struct zone *zone;
> -
> -	for_each_zone(zone) {
> -		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
> -
> -		/* Free per_cpu_pageset if it is slab allocated */
> -		if (pset != &boot_pageset[cpu])
> -			kfree(pset);
> -		zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -	}
> -}
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> 
>  static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
>  		unsigned long action,
>  		void *hcpu)
>  {
>  	int cpu = (long)hcpu;
> -	int ret = NOTIFY_OK;
> 
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> -		if (process_zones(cpu))
> -			ret = NOTIFY_BAD;
> -		break;
> -	case CPU_UP_CANCELED:
> -	case CPU_UP_CANCELED_FROZEN:
> -	case CPU_DEAD:
> -	case CPU_DEAD_FROZEN:
> -		free_zone_pagesets(cpu);
> +		node_set_state(cpu_to_node(cpu), N_CPU);
>  		break;
>  	default:
>  		break;
>  	}
> -	return ret;
> +	return NOTIFY_OK;
>  }
> 
>  static struct notifier_block __cpuinitdata pageset_notifier =
>  	{ &pageset_cpuup_callback, NULL, 0 };
> 
> +/*
> + * Allocate per cpu pagesets and initialize them.
> + * Before this call only boot pagesets were available.
> + * Boot pagesets will no longer be used by this processorr
> + * after setup_per_cpu_pageset().
> + */
>  void __init setup_per_cpu_pageset(void)
>  {
> -	int err;
> +	struct zone *zone;
> +	int cpu;
> 
> -	/* Initialize per_cpu_pageset for cpu 0.
> -	 * A cpuup callback will do this for every cpu
> -	 * as it comes online
> +	for_each_populated_zone(zone) {
> +		zone->pageset = alloc_percpu(struct per_cpu_pageset);
> +
> +		for_each_possible_cpu(cpu) {
> +			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> +
> +			setup_pageset(pcp, zone_batchsize(zone));
> +
> +			if (percpu_pagelist_fraction)
> +				setup_pagelist_highmark(pcp,
> +					(zone->present_pages /
> +						percpu_pagelist_fraction));
> +		}
> +	}
> +
> +	/*
> +	 * The boot cpu is always the first active.
> +	 * The boot node has a processor
>  	 */
> -	err = process_zones(smp_processor_id());
> -	BUG_ON(err);
> +	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
>  	register_cpu_notifier(&pageset_notifier);
>  }
> 
> -#endif
> -
>  static noinline __init_refok
>  int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
>  {
> @@ -3254,7 +3204,7 @@ static int __zone_pcp_update(void *data)
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
> 
> -		pset = zone_pcp(zone, cpu);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
>  		pcp = &pset->pcp;
> 
>  		local_irq_save(flags);
> @@ -3272,15 +3222,7 @@ void zone_pcp_update(struct zone *zone)
> 
>  /*
>   * Early setup of pagesets.
> - *
> - * In the NUMA case the pageset setup simply results in all zones pcp
> - * pointer being directed at a per cpu pageset with zero batchsize.
> - *
> - * This means that every free and every allocation occurs directly from
> - * the buddy allocator tables.
> - *
> - * The pageset never queues pages during early boot and is therefore usable
> - * for every type of zone.
> + * At this point various allocators are not operational yet.
>   */
>  __meminit void setup_pagesets(void)
>  {
> @@ -3288,23 +3230,15 @@ __meminit void setup_pagesets(void)
>  	struct zone *zone;
> 
>  	for_each_zone(zone) {
> -#ifdef CONFIG_NUMA
> -		unsigned long batch = 0;
> -
> -		for (cpu = 0; cpu < NR_CPUS; cpu++) {
> -			/* Early boot. Slab allocator not functional yet */
> -			zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -		}
> -#else
> -		unsigned long batch = zone_batchsize(zone);
> -#endif
> +		zone->pageset = &per_cpu_var(boot_pageset);
> 
> +		/*
> +		 * Special pagesets with one element so that frees
> +		 * and allocations are not buffered at all.
> +		 */
>  		for_each_possible_cpu(cpu)
> -			setup_pageset(zone_pcp(zone, cpu), batch);
> +			setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);
> 
> -		if (zone->present_pages)
> -			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
> -				zone->name, zone->present_pages, batch);
>  	}
>  }
> 
> @@ -4818,10 +4752,11 @@ int percpu_pagelist_fraction_sysctl_hand
>  	if (!write || (ret == -EINVAL))
>  		return ret;
>  	for_each_populated_zone(zone) {
> -		for_each_online_cpu(cpu) {
> +		for_each_possible_cpu(cpu) {
>  			unsigned long  high;
>  			high = zone->present_pages / percpu_pagelist_fraction;
> -			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
> +			setup_pagelist_highmark(
> +				per_cpu_ptr(zone->pageset, cpu), high);
>  		}
>  	}
>  	return 0;
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2009-10-06 12:48:46.000000000 -0500
> +++ linux-2.6/mm/vmstat.c	2009-10-06 13:59:23.000000000 -0500
> @@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
>  		threshold = calculate_threshold(zone);
> 
>  		for_each_online_cpu(cpu)
> -			zone_pcp(zone, cpu)->stat_threshold = threshold;
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
>  	}
>  }
> 
> @@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> +
>  	s8 *p = pcp->vm_stat_diff + item;
>  	long x;
> 
> @@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
>   */
>  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
> 
>  	(*p)++;
> @@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
> 
>  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
> 
>  	(*p)--;
> @@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_pageset *p;
> 
> -		p = zone_pcp(zone, cpu);
> +		p = per_cpu_ptr(zone->pageset, cpu);
> 
>  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>  			if (p->vm_stat_diff[i]) {
> @@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
>  	for_each_online_cpu(i) {
>  		struct per_cpu_pageset *pageset;
> 
> -		pageset = zone_pcp(zone, i);
> +		pageset = per_cpu_ptr(zone->pageset, i);
>  		seq_printf(m,
>  			   "\n    cpu: %i"
>  			   "\n              count: %i"
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 14/20] this_cpu ops: Remove pageset_notifier
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (12 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 15/20] Use this_cpu operations in slub cl
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_remove_pageset_notifier --]
[-- Type: text/plain, Size: 2044 bytes --]

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/page_alloc.c |   27 ---------------------------
 mm/vmstat.c     |    1 +
 2 files changed, 1 insertion(+), 27 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-09-29 09:02:29.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-09-29 09:04:18.000000000 -0500
@@ -906,6 +906,7 @@ static int __cpuinit vmstat_cpuup_callba
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		start_cpu_timer(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-09-29 09:04:16.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-09-29 09:04:18.000000000 -0500
@@ -3097,26 +3097,6 @@ static void setup_pagelist_highmark(stru
  */
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		node_set_state(cpu_to_node(cpu), N_CPU);
-		break;
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
@@ -3141,13 +3121,6 @@ void __init setup_per_cpu_pageset(void)
 						percpu_pagelist_fraction));
 		}
 	}
-
-	/*
-	 * The boot cpu is always the first active.
-	 * The boot node has a processor
-	 */
-	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
-	register_cpu_notifier(&pageset_notifier);
 }
 
 static noinline __init_refok

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 15/20] Use this_cpu operations in slub
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (13 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 14/20] this_cpu ops: Remove pageset_notifier cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 16/20] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Pekka Enberg, Tejun Heo, mingo, rusty

[-- Attachment #1: this_cpu_slub_conversion --]
[-- Type: text/plain, Size: 12709 bytes --]

Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases since per cpu pointer lookups and
   address calculations are avoided.

V2->V3:
- Leave Linus' code ornament alone.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  207 ++++++++++-------------------------------------
 2 files changed, 49 insertions(+), 164 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-17 17:51:51.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 09:02:05.000000000 -0500
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-28 10:08:10.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 09:02:05.000000000 -0500
@@ -242,15 +242,6 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1124,7 +1115,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1422,7 +1413,7 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
@@ -1454,7 +1445,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1673,7 +1661,7 @@ new_slab:
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __this_cpu_ptr(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1711,7 +1699,6 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned int objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1722,24 +1709,23 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
-	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	c = __this_cpu_ptr(s->cpu_slab);
+	object = c->freelist;
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
-		memset(object, 0, objsize);
+		memset(object, 0, s->objsize);
 
 	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, objsize, 1, s->flags, gfpflags);
+	kmemleak_alloc_recursive(object, c->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1800,7 +1786,7 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1872,7 +1858,7 @@ static __always_inline void slab_free(st
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, c->objsize);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
@@ -2095,130 +2081,28 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu [NR_KMEM_CACHE_CPU],
-		      kmem_cache_cpu);
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static DECLARE_BITMAP(kmem_cach_cpu_free_init_once, CONFIG_NR_CPUS);
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c >= per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[SLUB_PAGE_SHIFT]);
 
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
-static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpumask_test_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once)))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpumask_set_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once));
-}
-
-static void __init init_alloc_cpu(void)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
+	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+		/*
+		 * Boot time creation of the kmalloc array. Use static per cpu data
+		 * since the per cpu allocator is not available yet.
+		 */
+		s->cpu_slab = per_cpu_var(kmalloc_percpu) + (s - kmalloc_caches);
+	else
+		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
+	if (!s->cpu_slab)
+		return 0;
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2609,9 +2493,8 @@ static inline int kmem_cache_close(struc
 	int node;
 
 	flush_all(s);
-
+	free_percpu(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2760,7 +2643,19 @@ static noinline struct kmem_cache *dma_k
 	realsize = kmalloc_caches[index].objsize;
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
-	s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+
+	if (flags & __GFP_WAIT)
+		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	else {
+		int i;
+
+		s = NULL;
+		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
+			if (kmalloc_caches[i].size) {
+				s = kmalloc_caches + i;
+				break;
+			}
+	}
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -3176,8 +3071,6 @@ void __init kmem_cache_init(void)
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3261,8 +3154,10 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3365,7 +3260,7 @@ struct kmem_cache *kmem_cache_create(con
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3422,11 +3317,9 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3436,13 +3329,9 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3928,7 +3817,7 @@ static ssize_t show_slab_objects(struct 
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4353,7 +4242,7 @@ static int show_stat(struct kmem_cache *
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 16/20] SLUB: Get rid of dynamic DMA kmalloc cache allocation
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (14 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 15/20] Use this_cpu operations in slub cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 17/20] this_cpu: Remove slub kmem_cache fields cl
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Tejun Heo, mingo, rusty, Pekka Enberg

[-- Attachment #1: this_cpu_slub_static_dma_kmalloc --]
[-- Type: text/plain, Size: 3687 bytes --]

Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   19 +++++++++++--------
 mm/slub.c                |   24 ++++++++++--------------
 2 files changed, 21 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 11:43:18.000000000 -0500
@@ -131,11 +131,21 @@ struct kmem_cache {
 
 #define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
 
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
 	return &kmalloc_caches[index];
 }
 
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:43:18.000000000 -0500
@@ -2090,7 +2090,7 @@ static inline int alloc_kmem_cache_cpus(
 {
 	int cpu;
 
-	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
 		 * since the per cpu allocator is not available yet.
@@ -2537,7 +2537,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2627,6 +2627,7 @@ static noinline struct kmem_cache *dma_k
 	char *text;
 	size_t realsize;
 	unsigned long slabflags;
+	int i;
 
 	s = kmalloc_caches_dma[index];
 	if (s)
@@ -2647,18 +2648,13 @@ static noinline struct kmem_cache *dma_k
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
 
-	if (flags & __GFP_WAIT)
-		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
-	else {
-		int i;
+	s = NULL;
+	for (i = 0; i < KMALLOC_CACHES; i++)
+		if (kmalloc_caches[i].size)
+			break;
 
-		s = NULL;
-		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
-			if (kmalloc_caches[i].size) {
-				s = kmalloc_caches + i;
-				break;
-			}
-	}
+	BUG_ON(i >= KMALLOC_CACHES);
+	s = kmalloc_caches + i;
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -2672,7 +2668,7 @@ static noinline struct kmem_cache *dma_k
 
 	if (!s || !text || !kmem_cache_open(s, flags, text,
 			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		kfree(s);
+		s->size = 0;
 		kfree(text);
 		goto unlock_out;
 	}

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 17/20] this_cpu: Remove slub kmem_cache fields
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (15 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 16/20] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 18/20] Make slub statistics use this_cpu_inc cl
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Pekka Enberg, Tejun Heo, mingo, rusty

[-- Attachment #1: this_cpu_slub_remove_fields --]
[-- Type: text/plain, Size: 7809 bytes --]

Remove the fields in struct kmem_cache_cpu that were used to cache data from
struct kmem_cache when they were in different cachelines. The cacheline that
holds the per cpu array pointer now also holds these values. We can cut down
the struct kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the size field does not require accessing an additional cacheline anymore.
This results in consistent use of functions for setting the freepointer of
objects throughout SLUB.

Also we initialize all possible kmem_cache_cpu structures when a slab is
created. No need to initialize them when a processor or node comes online.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   81 +++++++++++++----------------------------------
 2 files changed, 24 insertions(+), 59 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-29 11:44:03.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 11:44:35.000000000 -0500
@@ -38,8 +38,6 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:03.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
@@ -260,13 +260,6 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1473,10 +1466,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1635,7 +1628,7 @@ load_freelist:
 	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1681,7 +1674,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1719,7 +1712,7 @@ static __always_inline void *slab_alloc(
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1727,8 +1720,8 @@ static __always_inline void *slab_alloc(
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, s->objsize);
 
-	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, c->objsize, 1, s->flags, gfpflags);
+	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1783,7 +1776,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_notr
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr, unsigned int offset)
+			void *x, unsigned long addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1797,7 +1790,8 @@ static void __slab_free(struct kmem_cach
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1862,16 +1856,16 @@ static __always_inline void slab_free(st
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	kmemcheck_slab_free(s, object, c->objsize);
-	debug_check_no_locks_freed(object, c->objsize);
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, c->objsize);
+		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -2058,19 +2052,6 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
-{
-	c->page = NULL;
-	c->freelist = NULL;
-	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-#ifdef CONFIG_SLUB_STATS
-	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
-#endif
-}
-
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
@@ -2088,8 +2069,6 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
-	int cpu;
-
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2102,8 +2081,6 @@ static inline int alloc_kmem_cache_cpus(
 	if (!s->cpu_slab)
 		return 0;
 
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
 
@@ -2387,8 +2364,16 @@ static int kmem_cache_open(struct kmem_c
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
 
-	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+	if (!alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+		return 0;
+
+	/*
+	 * gfp_flags would be flags & ~SLUB_DMA but the per cpu
+	 * allocator does not support it.
+	 */
+	if (s->cpu_slab)
 		return 1;
+
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
@@ -3245,22 +3230,12 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
-
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 
@@ -3314,14 +3289,6 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
-			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
-		up_read(&slub_lock);
-		break;
-
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 18/20] Make slub statistics use this_cpu_inc
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (16 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 17/20] this_cpu: Remove slub kmem_cache fields cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 19/20] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Pekka Enberg, Tejun Heo, mingo, rusty

[-- Attachment #1: this_cpu_slub_cleanup_stat --]
[-- Type: text/plain, Size: 5136 bytes --]

this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?

---
 mm/slub.c |   43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	c->stat[si]++;
+	__this_cpu_inc(s->cpu_slab->stat[si]);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+		stat(s, ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
 
 		if (page->freelist) {
 			add_partial(n, page, tail);
-			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
-			stat(c, DEACTIVATE_FULL);
+			stat(s, DEACTIVATE_FULL);
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
 	} else {
-		stat(c, DEACTIVATE_EMPTY);
+		stat(s, DEACTIVATE_EMPTY);
 		if (n->nr_partial < s->min_partial) {
 			/*
 			 * Adding an empty slab to the partial slabs in order
@@ -1438,7 +1437,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
+			stat(s, FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1453,7 +1452,7 @@ static void deactivate_slab(struct kmem_
 	int tail = 1;
 
 	if (page->freelist)
-		stat(c, DEACTIVATE_REMOTE_FREES);
+		stat(s, DEACTIVATE_REMOTE_FREES);
 	/*
 	 * Merge cpu freelist into slab freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
@@ -1479,7 +1478,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(c, CPUSLAB_FLUSH);
+	stat(s, CPUSLAB_FLUSH);
 	slab_lock(c->page);
 	deactivate_slab(s, c);
 }
@@ -1619,7 +1618,7 @@ static void *__slab_alloc(struct kmem_ca
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 
-	stat(c, ALLOC_REFILL);
+	stat(s, ALLOC_REFILL);
 
 load_freelist:
 	object = c->page->freelist;
@@ -1634,7 +1633,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-	stat(c, ALLOC_SLOWPATH);
+	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
@@ -1644,7 +1643,7 @@ new_slab:
 	new = get_partial(s, gfpflags, node);
 	if (new) {
 		c->page = new;
-		stat(c, ALLOC_FROM_PARTIAL);
+		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
 
@@ -1658,7 +1657,7 @@ new_slab:
 
 	if (new) {
 		c = __this_cpu_ptr(s->cpu_slab);
-		stat(c, ALLOC_SLAB);
+		stat(s, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
 		slab_lock(new);
@@ -1713,7 +1712,7 @@ static __always_inline void *slab_alloc(
 
 	else {
 		c->freelist = get_freepointer(s, object);
-		stat(c, ALLOC_FASTPATH);
+		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
@@ -1780,10 +1779,8 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 
-	c = __this_cpu_ptr(s->cpu_slab);
-	stat(c, FREE_SLOWPATH);
+	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
@@ -1796,7 +1793,7 @@ checks_ok:
 	page->inuse--;
 
 	if (unlikely(PageSlubFrozen(page))) {
-		stat(c, FREE_FROZEN);
+		stat(s, FREE_FROZEN);
 		goto out_unlock;
 	}
 
@@ -1809,7 +1806,7 @@ checks_ok:
 	 */
 	if (unlikely(!prior)) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(c, FREE_ADD_PARTIAL);
+		stat(s, FREE_ADD_PARTIAL);
 	}
 
 out_unlock:
@@ -1822,10 +1819,10 @@ slab_empty:
 		 * Slab still on the partial list.
 		 */
 		remove_partial(s, page);
-		stat(c, FREE_REMOVE_PARTIAL);
+		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-	stat(c, FREE_SLAB);
+	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
 
@@ -1863,7 +1860,7 @@ static __always_inline void slab_free(st
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
-		stat(c, FREE_FASTPATH);
+		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 19/20] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (17 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 18/20] Make slub statistics use this_cpu_inc cl
@ 2009-10-01 21:25 ` cl
  2009-10-01 21:25 ` [this_cpu_xx V4 20/20] SLUB: Experimental new fastpath w/o interrupt disable cl
  2009-10-02  9:30 ` [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic Tejun Heo
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mathieu Desnoyers, Pekka Enberg, Tejun Heo, mingo,
	rusty

[-- Attachment #1: this_cpu_slub_aggressive_cpu_ops --]
[-- Type: text/plain, Size: 5869 bytes --]

Use this_cpu_* operations in the hotpath to avoid calculations of
kmem_cache_cpu pointer addresses.

On x86 there is a tradeof: Multiple uses segment prefixes against an
address calculation and more register pressure. Code size is reduced
therefore it is an advantage.

The use of prefixes is necessary if we want to use
Mathieus' scheme for fastpaths that do not require disabling
interrupts.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   80 ++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 39 insertions(+), 41 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-30 15:58:20.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-30 16:24:45.000000000 -0500
@@ -1512,10 +1512,10 @@ static void flush_all(struct kmem_cache 
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
  */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int node_match(struct kmem_cache *s, int node)
 {
 #ifdef CONFIG_NUMA
-	if (node != -1 && c->node != node)
+	if (node != -1 && __this_cpu_read(s->cpu_slab->node) != node)
 		return 0;
 #endif
 	return 1;
@@ -1603,46 +1603,46 @@ slab_out_of_memory(struct kmem_cache *s,
  * a call to the page allocator and the setup of a new slab.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr)
 {
 	void **object;
-	struct page *new;
+	struct page *page = __this_cpu_read(s->cpu_slab->page);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
-	if (!c->page)
+	if (!page)
 		goto new_slab;
 
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	slab_lock(page);
+	if (unlikely(!node_match(s, node)))
 		goto another_slab;
 
 	stat(s, ALLOC_REFILL);
 
 load_freelist:
-	object = c->page->freelist;
+	object = page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
 		goto debug;
 
-	c->freelist = get_freepointer(s, object);
-	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
-	c->node = page_to_nid(c->page);
+	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
+	page->inuse = page->objects;
+	page->freelist = NULL;
+	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
 unlock_out:
-	slab_unlock(c->page);
+	slab_unlock(page);
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
-	deactivate_slab(s, c);
+	deactivate_slab(s, __this_cpu_ptr(s->cpu_slab));
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
-	if (new) {
-		c->page = new;
+	page = get_partial(s, gfpflags, node);
+	if (page) {
+		__this_cpu_write(s->cpu_slab->page, page);
 		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
@@ -1650,31 +1650,30 @@ new_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	page = new_slab(s, gfpflags, node);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
-	if (new) {
-		c = __this_cpu_ptr(s->cpu_slab);
+	if (page) {
 		stat(s, ALLOC_SLAB);
-		if (c->page)
-			flush_slab(s, c);
-		slab_lock(new);
-		__SetPageSlubFrozen(new);
-		c->page = new;
+		if (__this_cpu_read(s->cpu_slab->page))
+			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
+		slab_lock(page);
+		__SetPageSlubFrozen(page);
+		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
 debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (!alloc_debug_processing(s, page, object, addr))
 		goto another_slab;
 
-	c->page->inuse++;
-	c->page->freelist = get_freepointer(s, object);
-	c->node = -1;
+	page->inuse++;
+	page->freelist = get_freepointer(s, object);
+	__this_cpu_write(s->cpu_slab->node, -1);
 	goto unlock_out;
 }
 
@@ -1692,7 +1691,6 @@ static __always_inline void *slab_alloc(
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	gfpflags &= gfp_allowed_mask;
@@ -1704,14 +1702,14 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
-	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	object = __this_cpu_read(s->cpu_slab->freelist);
+	if (unlikely(!object || !node_match(s, node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, node, addr);
 
 	else {
-		c->freelist = get_freepointer(s, object);
+		__this_cpu_write(s->cpu_slab->freelist,
+			get_freepointer(s, object));
 		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1847,19 +1845,19 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
-	if (likely(page == c->page && c->node >= 0)) {
-		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
+
+	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
+			__this_cpu_read(s->cpu_slab->node) >= 0)) {
+		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
+		__this_cpu_write(s->cpu_slab->freelist, object);
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [this_cpu_xx V4 20/20] SLUB: Experimental new fastpath w/o interrupt disable
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (18 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 19/20] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
@ 2009-10-01 21:25 ` cl
  2009-10-02  9:30 ` [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic Tejun Heo
  20 siblings, 0 replies; 65+ messages in thread
From: cl @ 2009-10-01 21:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Mathieu Desnoyers, Pekka Enberg, Tejun Heo, mingo,
	rusty

[-- Attachment #1: this_cpu_slub_irqless --]
[-- Type: text/plain, Size: 7689 bytes --]

This is a bit of a different tack on things than the last version provided
by Mathieu.

Instead of using a cmpxchg we keep a state variable in the per cpu structure
that is incremented when we enter the hot path. We can then detect that
a thread is in the fastpath. For recursive calling scenarios we can fallback
to alternate allocation / free techniques that bypass fastpath caching.

A disadvantage is that we have to disable preempt. But if preemt is disabled
(like on most kernels that I run) then the hotpath becomes very efficient.

WARNING: Very experimental

It would be good to compare against an update of Mathieu's latest which
implemented pointer versioning to avoid even disabling preemption.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 include/linux/slub_def.h |    1 
 mm/slub.c                |   91 +++++++++++++++++++++++++++++++++++++----------
 2 files changed, 74 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-10-01 15:53:15.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-10-01 15:53:15.000000000 -0500
@@ -38,6 +38,7 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+	int active;		/* Active fastpaths */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-01 15:53:15.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-01 15:53:15.000000000 -0500
@@ -1606,7 +1606,14 @@ static void *__slab_alloc(struct kmem_ca
 			  unsigned long addr)
 {
 	void **object;
-	struct page *page = __this_cpu_read(s->cpu_slab->page);
+	struct page *page;
+	unsigned long flags;
+	int hotpath;
+
+	local_irq_save(flags);
+	preempt_enable();	/* Get rid of count */
+	hotpath = __this_cpu_read(s->cpu_slab->active) != 0;
+	page = __this_cpu_read(s->cpu_slab->page);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1626,13 +1633,21 @@ load_freelist:
 		goto another_slab;
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
 		goto debug;
-
-	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
-	page->inuse = page->objects;
-	page->freelist = NULL;
-	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+	if (unlikely(hotpath)) {
+		/* Object on second free list available and hotpath busy */
+		page->inuse++;
+		page->freelist = get_freepointer(s, object);
+	} else {
+		/* Prepare new list of objects for hotpath */
+		__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
+		page->inuse = page->objects;
+		page->freelist = NULL;
+		__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+	}
 unlock_out:
+	__this_cpu_dec(s->cpu_slab->active);
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
@@ -1642,8 +1657,12 @@ another_slab:
 new_slab:
 	page = get_partial(s, gfpflags, node);
 	if (page) {
-		__this_cpu_write(s->cpu_slab->page, page);
 		stat(s, ALLOC_FROM_PARTIAL);
+
+		if (hotpath)
+			goto hot_lock;
+
+		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
 
@@ -1657,6 +1676,10 @@ new_slab:
 
 	if (page) {
 		stat(s, ALLOC_SLAB);
+
+		if (hotpath)
+			 goto hot_no_lock;
+
 		if (__this_cpu_read(s->cpu_slab->page))
 			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
 		slab_lock(page);
@@ -1664,6 +1687,10 @@ new_slab:
 		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
+
+	__this_cpu_dec(s->cpu_slab->active);
+	local_irq_restore(flags);
+
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
@@ -1675,6 +1702,19 @@ debug:
 	page->freelist = get_freepointer(s, object);
 	__this_cpu_write(s->cpu_slab->node, -1);
 	goto unlock_out;
+
+	/*
+	 * Hotpath is busy and we need to avoid touching
+	 * hotpath variables
+	 */
+hot_no_lock:
+	slab_lock(page);
+hot_lock:
+	__ClearPageSlubFrozen(page);
+	if (get_freepointer(s, page->freelist))
+		/* Cannot put page into the hotpath. Instead back to partial */
+		add_partial(get_node(s, page_to_nid(page)), page, 0);
+	goto load_freelist;
 }
 
 /*
@@ -1691,7 +1731,6 @@ static __always_inline void *slab_alloc(
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	unsigned long flags;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1701,19 +1740,21 @@ static __always_inline void *slab_alloc(
 	if (should_failslab(s->objsize, gfpflags))
 		return NULL;
 
-	local_irq_save(flags);
+	preempt_disable();
+	irqsafe_cpu_inc(s->cpu_slab->active);
 	object = __this_cpu_read(s->cpu_slab->freelist);
-	if (unlikely(!object || !node_match(s, node)))
+	if (unlikely(!object || !node_match(s, node) ||
+			__this_cpu_read(s->cpu_slab->active)))
 
 		object = __slab_alloc(s, gfpflags, node, addr);
 
 	else {
 		__this_cpu_write(s->cpu_slab->freelist,
 			get_freepointer(s, object));
+		irqsafe_cpu_dec(s->cpu_slab->active);
+		preempt_enable();
 		stat(s, ALLOC_FASTPATH);
 	}
-	local_irq_restore(flags);
-
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, s->objsize);
 
@@ -1777,6 +1818,11 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	preempt_enable();	/* Fix up count */
+	__this_cpu_dec(s->cpu_slab->active);
 
 	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
@@ -1809,6 +1855,7 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
+	local_irq_restore(flags);
 	return;
 
 slab_empty:
@@ -1820,6 +1867,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
+	local_irq_restore(flags);
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -1845,24 +1893,26 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
-	local_irq_save(flags);
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 
+	preempt_disable();
+	irqsafe_cpu_inc(s->cpu_slab->active);
 	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
-			__this_cpu_read(s->cpu_slab->node) >= 0)) {
-		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
+			__this_cpu_read(s->cpu_slab->node) >= 0) &&
+			!__this_cpu_read(s->cpu_slab->active)) {
+		set_freepointer(s, object,
+			__this_cpu_read(s->cpu_slab->freelist));
 		__this_cpu_write(s->cpu_slab->freelist, object);
+		irqsafe_cpu_dec(s->cpu_slab->active);
+		preempt_enable();
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
-
-	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -2064,6 +2114,8 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
+	int cpu;
+
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2073,6 +2125,9 @@ static inline int alloc_kmem_cache_cpus(
 	else
 		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(s->cpu_slab, cpu)->active = -1;
+
 	if (!s->cpu_slab)
 		return 0;
 

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (19 preceding siblings ...)
  2009-10-01 21:25 ` [this_cpu_xx V4 20/20] SLUB: Experimental new fastpath w/o interrupt disable cl
@ 2009-10-02  9:30 ` Tejun Heo
  2009-10-02  9:54   ` Ingo Molnar
  2009-10-02 17:10   ` Christoph Lameter
  20 siblings, 2 replies; 65+ messages in thread
From: Tejun Heo @ 2009-10-02  9:30 UTC (permalink / raw)
  To: cl; +Cc: akpm, linux-kernel, mingo, rusty, Pekka Enberg

Hello,

cl@linux-foundation.org wrote:
> V3->V4:
> - Fix various macro definitions.
> - Provider experimental percpu based fastpath that does not disable
>   interrupts for SLUB.

The series looks very good to me.  percpu#for-next now has ia64 bits
included and the legacy allocator is gone there so it can carry this
series.  Sans the last one, they seem they can be stable and
incremental from now on, right?  Shall I include this series into the
percpu tree?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-02  9:30 ` [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic Tejun Heo
@ 2009-10-02  9:54   ` Ingo Molnar
  2009-10-02 17:15     ` Christoph Lameter
  2009-10-02 17:10   ` Christoph Lameter
  1 sibling, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2009-10-02  9:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cl, akpm, linux-kernel, rusty, Pekka Enberg, Linus Torvalds

* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> cl@linux-foundation.org wrote:
> > V3->V4:
> > - Fix various macro definitions.
> > - Provider experimental percpu based fastpath that does not disable
> >   interrupts for SLUB.
> 
> The series looks very good to me. [...]

Seconded, very nice series!

One final step/cleanup seems to be missing from it: it should replace 
current uses of percpu_op() [percpu_read(), etc.] in the x86 tree and 
elsewhere with the new this_cpu_*() primitives. this_cpu_*() is using 
per_cpu_from_op/per_cpu_to_op directly, we dont need those percpu_op() 
variants anymore.

There should also be a kernel image size comparison done for that step, 
to make sure all the new primitives are optimized to the max on the 
instruction level.

> [...] percpu#for-next now has ia64 bits included and the legacy 
> allocator is gone there so it can carry this series.  Sans the last 
> one, they seem they can be stable and incremental from now on, right?  
> Shall I include this series into the percpu tree?

I'd definitely recommend doing that - it should be tested early and wide 
for v2.6.33, and together with other percpu bits.

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-02  9:54   ` Ingo Molnar
@ 2009-10-02 17:15     ` Christoph Lameter
  2009-10-02 17:32       ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-02 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, akpm, linux-kernel, rusty, Pekka Enberg,
	Linus Torvalds

On Fri, 2 Oct 2009, Ingo Molnar wrote:

> One final step/cleanup seems to be missing from it: it should replace
> current uses of percpu_op() [percpu_read(), etc.] in the x86 tree and
> elsewhere with the new this_cpu_*() primitives. this_cpu_*() is using
> per_cpu_from_op/per_cpu_to_op directly, we dont need those percpu_op()
> variants anymore.

Well after things settle with this_cpu_xx we can drop those.

> There should also be a kernel image size comparison done for that step,
> to make sure all the new primitives are optimized to the max on the
> instruction level.

Right. There will be a time period in which other arches will need to add
support for this_cpu_xx first.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-02 17:15     ` Christoph Lameter
@ 2009-10-02 17:32       ` Ingo Molnar
  2009-10-02 17:49         ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, akpm, linux-kernel, rusty, Pekka Enberg,
	Linus Torvalds


* Christoph Lameter <cl@linux-foundation.org> wrote:

> On Fri, 2 Oct 2009, Ingo Molnar wrote:
> 
> > One final step/cleanup seems to be missing from it: it should 
> > replace current uses of percpu_op() [percpu_read(), etc.] in the x86 
> > tree and elsewhere with the new this_cpu_*() primitives. 
> > this_cpu_*() is using per_cpu_from_op/per_cpu_to_op directly, we 
> > dont need those percpu_op() variants anymore.
> 
> Well after things settle with this_cpu_xx we can drop those.
> 
> > There should also be a kernel image size comparison done for that 
> > step, to make sure all the new primitives are optimized to the max 
> > on the instruction level.
> 
> Right. There will be a time period in which other arches will need to 
> add support for this_cpu_xx first.

Size comparison should be only on architectures that support it (i.e. 
x86 right now). The generic fallbacks might be bloaty, no argument about 
that. ( => the more reason for any architecture to add optimizations for 
this_cpu_*() APIs. )

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-02 17:32       ` Ingo Molnar
@ 2009-10-02 17:49         ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-10-02 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, akpm, linux-kernel, rusty, Pekka Enberg,
	Linus Torvalds

On Fri, 2 Oct 2009, Ingo Molnar wrote:

> > Right. There will be a time period in which other arches will need to
> > add support for this_cpu_xx first.
>
> Size comparison should be only on architectures that support it (i.e.
> x86 right now). The generic fallbacks might be bloaty, no argument about
> that. ( => the more reason for any architecture to add optimizations for
> this_cpu_*() APIs. )

The fallbacks basically generate the same code (at least for the core
code) that was there before. F.e.

Before:

#define SNMP_INC_STATS(mib, field)     \
       do { \
               per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]++; \
               put_cpu(); \
       } while (0)

After

#define SNMP_INC_STATS_USER(mib, field)        \
                       this_cpu_inc(mib[1]->mibs[field])

For the x86 case this means that we can use a simple atomic increment
with a segment prefix to do all the work.

The fallback case for arches not providing per cpu atomics is:

	preempt_disable();
	*__this_cpu_ptr(&mib[1]->mibs[field]) += 1;
	preempt_enable();

If the arch can optimize __this_cpu_ptr (and provides __my_cpu_offset)
because it has the per cpu offset of the local cpu in some priviledged
location then this is still going to be a win since we avoid
smp_processor_id() entirely and we also avoid the array lookup.

If the arch has no such mechanism then we fall back for this_cpu_ptr too:

#ifndef __my_cpu_offset
#define __my_cpu_offset per_cpu_offset(raw_smp_processor_id())
#endif

And then the result in terms of overhead is the same as before the
per_cpu_xx patches since get_cpu() does both a preempt_disable as well as
a smp_processor_id() call.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-02  9:30 ` [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic Tejun Heo
  2009-10-02  9:54   ` Ingo Molnar
@ 2009-10-02 17:10   ` Christoph Lameter
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-10-02 17:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, linux-kernel, mingo, rusty, Pekka Enberg

On Fri, 2 Oct 2009, Tejun Heo wrote:

> cl@linux-foundation.org wrote:
> > V3->V4:
> > - Fix various macro definitions.
> > - Provider experimental percpu based fastpath that does not disable
> >   interrupts for SLUB.
>
> The series looks very good to me.  percpu#for-next now has ia64 bits
> included and the legacy allocator is gone there so it can carry this
> series.  Sans the last one, they seem they can be stable and
> incremental from now on, right?  Shall I include this series into the
> percpu tree?

You can include all but the last patch that is experimental.


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2009-10-13 11:52 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-01 21:25 [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
2009-10-01 21:25 ` [this_cpu_xx V4 01/20] Introduce this_cpu_ptr() and generic this_cpu_* operations cl
2009-10-02  9:16   ` Tejun Heo
2009-10-02  9:34   ` Ingo Molnar
2009-10-02 17:11     ` Christoph Lameter
2009-10-06 10:04       ` Rusty Russell
2009-10-06 23:39         ` Christoph Lameter
2009-10-06 23:55         ` Tejun Heo
2009-10-08 17:57         ` [Patchs vs. percpu-next] Use this_cpu_xx to dynamically allocate counters Christoph Lameter
2009-10-13 11:51           ` Rusty Russell
2009-10-08 18:06         ` Christoph Lameter
2009-10-01 21:25 ` [this_cpu_xx V4 02/20] this_cpu: X86 optimized this_cpu operations cl
2009-10-02  9:18   ` Tejun Heo
2009-10-02  9:59   ` Ingo Molnar
2009-10-03 19:33     ` Pekka Enberg
2009-10-04 16:47       ` Ingo Molnar
2009-10-04 16:51         ` Pekka Enberg
2009-10-01 21:25 ` [this_cpu_xx V4 03/20] Use this_cpu operations for SNMP statistics cl
2009-10-01 21:25 ` [this_cpu_xx V4 04/20] Use this_cpu operations for NFS statistics cl
2009-10-01 21:25 ` [this_cpu_xx V4 05/20] use this_cpu ops for network statistics cl
2009-10-01 21:25 ` [this_cpu_xx V4 06/20] this_cpu_ptr: Straight transformations cl
2009-10-01 21:25 ` [this_cpu_xx V4 07/20] this_cpu_ptr: Eliminate get/put_cpu cl
2009-10-01 21:25 ` [this_cpu_xx V4 08/20] this_cpu_ptr: xfs_icsb_modify_counters does not need "cpu" variable cl
2009-10-01 21:25 ` [this_cpu_xx V4 09/20] Use this_cpu_ptr in crypto subsystem cl
2009-10-01 21:25 ` [this_cpu_xx V4 10/20] Use this_cpu ops for VM statistics cl
2009-10-01 21:25 ` [this_cpu_xx V4 11/20] RCU: Use this_cpu operations cl
2009-10-03 10:52   ` Tejun Heo
2009-10-01 21:25 ` [this_cpu_xx V4 12/20] Move early initialization of pagesets out of zone_wait_table_init() cl
2009-10-02 14:16   ` Mel Gorman
2009-10-02 17:30     ` Christoph Lameter
2009-10-05  9:35       ` Mel Gorman
2009-10-03 10:29   ` Tejun Heo
2009-10-05 14:39     ` Christoph Lameter
2009-10-05 15:01       ` Tejun Heo
2009-10-05 15:06         ` Christoph Lameter
2009-10-05 15:21           ` Tejun Heo
2009-10-05 15:28             ` Christoph Lameter
2009-10-05 15:41               ` Tejun Heo
2009-10-05 15:39                 ` Christoph Lameter
2009-10-01 21:25 ` [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion cl
2009-10-02 15:14   ` Mel Gorman
2009-10-02 17:39     ` Christoph Lameter
2009-10-05  9:45       ` Mel Gorman
2009-10-05 14:43         ` Christoph Lameter
2009-10-05 14:55           ` Christoph Lameter
2009-10-06  9:45             ` Mel Gorman
2009-10-06 16:34               ` Christoph Lameter
2009-10-06 17:03                 ` Mel Gorman
2009-10-06 17:51                   ` Christoph Lameter
2009-10-06 18:36                     ` Mel Gorman
2009-10-06 19:06                       ` Christoph Lameter
2009-10-07 10:42                         ` Mel Gorman
2009-10-01 21:25 ` [this_cpu_xx V4 14/20] this_cpu ops: Remove pageset_notifier cl
2009-10-01 21:25 ` [this_cpu_xx V4 15/20] Use this_cpu operations in slub cl
2009-10-01 21:25 ` [this_cpu_xx V4 16/20] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
2009-10-01 21:25 ` [this_cpu_xx V4 17/20] this_cpu: Remove slub kmem_cache fields cl
2009-10-01 21:25 ` [this_cpu_xx V4 18/20] Make slub statistics use this_cpu_inc cl
2009-10-01 21:25 ` [this_cpu_xx V4 19/20] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
2009-10-01 21:25 ` [this_cpu_xx V4 20/20] SLUB: Experimental new fastpath w/o interrupt disable cl
2009-10-02  9:30 ` [this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic Tejun Heo
2009-10-02  9:54   ` Ingo Molnar
2009-10-02 17:15     ` Christoph Lameter
2009-10-02 17:32       ` Ingo Molnar
2009-10-02 17:49         ` Christoph Lameter
2009-10-02 17:10   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).