[patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
@ 2007-11-01  0:02 Christoph Lameter
  2007-11-01  0:02 ` [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array Christoph Lameter
                   ` (8 more replies)
  0 siblings, 9 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

	this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):

       SLAB    SLUB    SLUB+   SLUB-o	SLUB-a
   8    96      86      45      44      38	3 *
  16    84      92      49      48      43	2 *
  32    84      106     61      59      53	+++
  64    102     129     82      88      75	++
 128    147     226     188     181     176	-
 256    200     248     207     285     204	=
 512    300     301     260     209     250	+
1024    416     440     398     264     391	++
2048    720     542     530     390     511	+++
4096    1254    342     342     336     376	3 *

alloc/free test
      SLAB    SLUB    SLUB+   SLUB-o	SLUB-a
      137-146 151     68-72   68-74	56-58	3 *

Note: The per cpu optimization are only half way there because of the screwed
up way that x86_64 handles its cpu area that causes addditional cycles to be
spend by retrieving a pointer from memory and adding it to the address.
The i386 code is much less cycle intensive being able to get to per cpu
data using a segment prefix and if we can get that to work on x86_64
then we may be able to get the cycle count for the fastpath down to 20-30
cycles.

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  7:24   ` Eric Dumazet
  2007-11-01  0:02 ` [patch 2/7] allocpercpu: Remove functions that are rarely used Christoph Lameter
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: newallocpercpu --]
[-- Type: text/plain, Size: 10328 bytes --]

Currently each call to alloc_percpu allocates an array of pointer to objects.
For each operation on a percpu structure we need to follow a pointer from that
map. Usually a processor used only the entry for its own processor id in that
array. The rest of the bytes in the cacheline are not needed. This repeats
itself for each and every per cpu array in use.

Moreover the result of alloc_percpu is not a variable that can be handled like
a regular per cpu variable.

The approach here changes the way allocpercpu is done. Objects are placed
in preallocated per cpu areas that are indexed via the existing per cpu array
of pointers. So we have a single array of pointer to per cpu areas that
is used by all per cpu operations. The data is placed tightly next to each
other for each processor so that the likelyhood of a single cache line covering
data for multiple needs is increased. The cache footprint of the allocpercpu
operations sinks dramatically. Some processors have the ability to map the
per cpu area of the current processor in a special way so that variables in
that area can be reached very efficiently. It is rather typical that a processor
only uses its own per processor area. On many architectures the indexing via
the per cpu array can then be completely bypassed.

The size of the per cpu alloc area is defined to be 32k per processor for now.

Another advantage of this approach is that the onlining and offlining
of the per cpu items is handled in a global way. On onlining a cpu all
objects become present without callbacks. Similarly on offlining a cpu
all per cpu objects vanish without the need of callbacks. Callbacks
may still be needed to do preparation and cleanup of the data areas
but the freeing and allocation of the per cpu areas no longer needs to
be done by the subsystems.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/percpu.h |   14 +---
 include/linux/vmstat.h |    2 
 mm/allocpercpu.c       |  163 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    1 
 4 files changed, 152 insertions(+), 28 deletions(-)

Index: linux-2.6/mm/allocpercpu.c
===================================================================
--- linux-2.6.orig/mm/allocpercpu.c	2007-10-31 16:39:13.584621383 -0700
+++ linux-2.6/mm/allocpercpu.c	2007-10-31 16:39:15.924121250 -0700
@@ -2,10 +2,140 @@
  * linux/mm/allocpercpu.c
  *
  * Separated from slab.c August 11, 2006 Christoph Lameter <clameter@sgi.com>
+ *
+ * (C) 2007 SGI, Christoph Lameter <clameter@sgi.com>
+ * 	Basic implementation with allocation and free from a dedicated per
+ * 	cpu area.
+ *
+ * The per cpu allocator allows allocation of memory from a statically
+ * allocated per cpu array and consists of cells of UNIT_SIZE. A byte array
+ * is used to describe the state of each of the available units that can be
+ * allocated via cpu_alloc() and freed via cpu_free(). The possible states are:
+ *
+ * FREE	= The per cpu unit is not allocated
+ * USED	= The per cpu unit is allocated and more units follow.
+ * END	= The last per cpu unit used for an allocation (needed to
+ * 	  establish the size of the allocation on free)
+ *
+ * The per cpu allocator is typically used to allocate small sized object from 8 to 32
+ * bytes and it is rarely used. Allocation is looking for the first available object
+ * in the cpu_alloc_map. If the allocator would be used frequently with varying sizes
+ * of objects then we may end up with fragmentation.
  */
 #include <linux/mm.h>
 #include <linux/module.h>
 
+/*
+ * Maximum allowed per cpu data per cpu
+ */
+#define PER_CPU_ALLOC_SIZE 32768
+
+#define UNIT_SIZE sizeof(unsigned long long)
+#define UNITS_PER_CPU (PER_CPU_ALLOC_SIZE / UNIT_SIZE)
+
+enum unit_type { FREE, END, USED };
+
+static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
+static DEFINE_SPINLOCK(cpu_alloc_map_lock);
+static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
+
+#define CPU_DATA_OFFSET ((unsigned long)&per_cpu__cpu_area)
+
+/*
+ * How many units are needed for an object of a given size
+ */
+static int size_to_units(unsigned long size)
+{
+	return DIV_ROUND_UP(size, UNIT_SIZE);
+}
+
+/*
+ * Mark an object as used in the cpu_alloc_map
+ *
+ * Must hold cpu_alloc_map_lock
+ */
+static void set_map(int start, int length)
+{
+	cpu_alloc_map[start + length - 1] = END;
+	if (length > 1)
+		memset(cpu_alloc_map + start, USED, length - 1);
+}
+
+/*
+ * Mark an area as freed.
+ *
+ * Must hold cpu_alloc_map_lock
+ *
+ * Return the number of units taken up by the object freed.
+ */
+static int clear_map(int start)
+{
+	int units = 0;
+
+	while (cpu_alloc_map[start + units] == USED) {
+		cpu_alloc_map[start + units] = FREE;
+		units++;
+	}
+	BUG_ON(cpu_alloc_map[start] != END);
+	cpu_alloc_map[start] = FREE;
+	return units + 1;
+}
+
+/*
+ * Allocate an object of a certain size
+ *
+ * Returns a per cpu pointer that must not be directly used.
+ */
+static void *cpu_alloc(unsigned long size)
+{
+	unsigned long start = 0;
+	int units = size_to_units(size);
+	unsigned end;
+
+	spin_lock(&cpu_alloc_map_lock);
+	do {
+		while (start < UNITS_PER_CPU &&
+				cpu_alloc_map[start] != FREE)
+			start++;
+		if (start == UNITS_PER_CPU)
+			return NULL;
+
+		end = start + 1;
+		while (end < UNITS_PER_CPU && end - start < units &&
+				cpu_alloc_map[end] == FREE)
+			end++;
+		if (end - start == units)
+			break;
+		start = end;
+	} while (1);
+
+	set_map(start, units);
+	__count_vm_events(ALLOC_PERCPU, units * UNIT_SIZE);
+	spin_unlock(&cpu_alloc_map_lock);
+	return (void *)(start * UNIT_SIZE + CPU_DATA_OFFSET);
+}
+
+/*
+ * Free an object. The pointer must be a per cpu pointer allocated
+ * via cpu_alloc.
+ */
+static inline void cpu_free(void *pcpu)
+{
+	unsigned long start = (unsigned long)pcpu;
+	int index;
+	int units;
+
+	BUG_ON(start < CPU_DATA_OFFSET);
+	index = (start - CPU_DATA_OFFSET) / UNIT_SIZE;
+	BUG_ON(cpu_alloc_map[index] == FREE ||
+			index >= UNITS_PER_CPU);
+
+	spin_lock(&cpu_alloc_map_lock);
+	units = clear_map(index);
+	__count_vm_events(ALLOC_PERCPU, -units * UNIT_SIZE);
+	spin_unlock(&cpu_alloc_map_lock);
+}
+
 /**
  * percpu_depopulate - depopulate per-cpu data for given cpu
  * @__pdata: per-cpu data to depopulate
@@ -16,10 +146,10 @@
  */
 void percpu_depopulate(void *__pdata, int cpu)
 {
-	struct percpu_data *pdata = __percpu_disguise(__pdata);
-
-	kfree(pdata->ptrs[cpu]);
-	pdata->ptrs[cpu] = NULL;
+	/*
+	 * Nothing to do here. Removal can only be effected for all
+	 * per cpu areas of a cpu at once.
+	 */
 }
 EXPORT_SYMBOL_GPL(percpu_depopulate);
 
@@ -30,9 +160,9 @@ EXPORT_SYMBOL_GPL(percpu_depopulate);
  */
 void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
 {
-	int cpu;
-	for_each_cpu_mask(cpu, *mask)
-		percpu_depopulate(__pdata, cpu);
+	/*
+	 * Nothing to do
+	 */
 }
 EXPORT_SYMBOL_GPL(__percpu_depopulate_mask);
 
@@ -49,15 +179,11 @@ EXPORT_SYMBOL_GPL(__percpu_depopulate_ma
  */
 void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
 {
-	struct percpu_data *pdata = __percpu_disguise(__pdata);
-	int node = cpu_to_node(cpu);
+	int pdata = (unsigned long)__percpu_disguise(__pdata);
+	void *p = (void *)per_cpu_offset(cpu) + pdata;
 
-	BUG_ON(pdata->ptrs[cpu]);
-	if (node_online(node))
-		pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
-	else
-		pdata->ptrs[cpu] = kzalloc(size, gfp);
-	return pdata->ptrs[cpu];
+	memset(p, 0, size);
+	return p;
 }
 EXPORT_SYMBOL_GPL(percpu_populate);
 
@@ -98,14 +224,13 @@ EXPORT_SYMBOL_GPL(__percpu_populate_mask
  */
 void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
 {
-	void *pdata = kzalloc(sizeof(struct percpu_data), gfp);
+	void *pdata = cpu_alloc(size);
 	void *__pdata = __percpu_disguise(pdata);
 
 	if (unlikely(!pdata))
 		return NULL;
 	if (likely(!__percpu_populate_mask(__pdata, size, gfp, mask)))
 		return __pdata;
-	kfree(pdata);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(__percpu_alloc_mask);
@@ -121,7 +246,7 @@ void percpu_free(void *__pdata)
 {
 	if (unlikely(!__pdata))
 		return;
-	__percpu_depopulate_mask(__pdata, &cpu_possible_map);
-	kfree(__percpu_disguise(__pdata));
+	cpu_free(__percpu_disguise(__pdata));
 }
 EXPORT_SYMBOL_GPL(percpu_free);
+
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:39:13.596621181 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:40:04.831371052 -0700
@@ -33,20 +33,18 @@
 
 #ifdef CONFIG_SMP
 
-struct percpu_data {
-	void *ptrs[NR_CPUS];
-};
+#define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
 /* 
  * Use this to get to a cpu's version of the per-cpu object dynamically
  * allocated. Non-atomic access to the current CPU's version should
  * probably be combined with get_cpu()/put_cpu().
  */ 
-#define percpu_ptr(ptr, cpu)                              \
-({                                                        \
-        struct percpu_data *__p = __percpu_disguise(ptr); \
-        (__typeof__(ptr))__p->ptrs[(cpu)];	          \
+#define percpu_ptr(ptr, cpu)           			\
+({							\
+	void *p = __percpu_disguise(ptr);		\
+	unsigned long q = per_cpu_offset(cpu);		\
+    	(__typeof__(ptr))(p + q);			\
 })
 
 extern void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu);
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2007-10-31 16:39:13.604621189 -0700
+++ linux-2.6/include/linux/vmstat.h	2007-10-31 16:39:15.924121250 -0700
@@ -36,7 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		FOR_ALL_ZONES(PGSCAN_KSWAPD),
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
-		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		PAGEOUTRUN, ALLOCSTALL, PGROTATED, ALLOC_PERCPU,
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2007-10-31 16:39:13.592621141 -0700
+++ linux-2.6/mm/vmstat.c	2007-10-31 16:39:15.924121250 -0700
@@ -642,6 +642,7 @@ static const char * const vmstat_text[] 
 	"allocstall",
 
 	"pgrotated",
+	"alloc_percpu_bytes",
 #endif
 };
 

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 2/7] allocpercpu: Remove functions that are rarely used.
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
  2007-11-01  0:02 ` [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  0:02 ` [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set Christoph Lameter
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: rm_old_function --]
[-- Type: text/plain, Size: 8966 bytes --]

Population and depopulation is no longer needed since newly created
per cpu areas will have all the fields needed. Teardown of per cpu
areas will remove objects no longer needed.

This basically reverts the API to the way it was before the population and
depopulation went in. There is only a single user in the kernel that uses these
functions in net/iucv/iucv.c which is S/390 specific.

Remove the useless population and depopulation functions there. In that driver
we have the single occurrence of a per cpu allocations that uses GFP flags.
The allocation from the DMA zone is required in order to have memory below 2G. But
it seems that the per cpu areas are also under 2G so we are fine there.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/percpu.h |   42 +------------------
 mm/allocpercpu.c       |  104 +++++--------------------------------------------
 net/iucv/iucv.c        |   31 +++-----------
 3 files changed, 21 insertions(+), 156 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:40:04.831371052 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:40:14.892121256 -0700
@@ -47,41 +47,16 @@
     	(__typeof__(ptr))(p + q);			\
 })
 
-extern void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu);
-extern void percpu_depopulate(void *__pdata, int cpu);
-extern int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
-				  cpumask_t *mask);
-extern void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask);
-extern void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask);
+extern void *__alloc_percpu(size_t size);
 extern void percpu_free(void *__pdata);
 
 #else /* CONFIG_SMP */
 
 #define percpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
 
-static inline void percpu_depopulate(void *__pdata, int cpu)
+static __always_inline void *__alloc_percpu(size_t size)
 {
-}
-
-static inline void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
-{
-}
-
-static inline void *percpu_populate(void *__pdata, size_t size, gfp_t gfp,
-				    int cpu)
-{
-	return percpu_ptr(__pdata, cpu);
-}
-
-static inline int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
-					 cpumask_t *mask)
-{
-	return 0;
-}
-
-static __always_inline void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
-{
-	return kzalloc(size, gfp);
+	return kzalloc(size, GFP_KERNEL);
 }
 
 static inline void percpu_free(void *__pdata)
@@ -91,19 +66,8 @@ static inline void percpu_free(void *__p
 
 #endif /* CONFIG_SMP */
 
-#define percpu_populate_mask(__pdata, size, gfp, mask) \
-	__percpu_populate_mask((__pdata), (size), (gfp), &(mask))
-#define percpu_depopulate_mask(__pdata, mask) \
-	__percpu_depopulate_mask((__pdata), &(mask))
-#define percpu_alloc_mask(size, gfp, mask) \
-	__percpu_alloc_mask((size), (gfp), &(mask))
-
-#define percpu_alloc(size, gfp) percpu_alloc_mask((size), (gfp), cpu_online_map)
-
 /* (legacy) interface for use without CPU hotplug handling */
 
-#define __alloc_percpu(size)	percpu_alloc_mask((size), GFP_KERNEL, \
-						  cpu_possible_map)
 #define alloc_percpu(type)	(type *)__alloc_percpu(sizeof(type))
 #define free_percpu(ptr)	percpu_free((ptr))
 #define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
Index: linux-2.6/mm/allocpercpu.c
===================================================================
--- linux-2.6.orig/mm/allocpercpu.c	2007-10-31 16:39:15.924121250 -0700
+++ linux-2.6/mm/allocpercpu.c	2007-10-31 16:40:14.892121256 -0700
@@ -136,111 +136,29 @@ static inline void cpu_free(void *pcpu)
 	spin_unlock(&cpu_alloc_map_lock);
 }
 
-/**
- * percpu_depopulate - depopulate per-cpu data for given cpu
- * @__pdata: per-cpu data to depopulate
- * @cpu: depopulate per-cpu data for this cpu
- *
- * Depopulating per-cpu data for a cpu going offline would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- */
-void percpu_depopulate(void *__pdata, int cpu)
-{
-	/*
-	 * Nothing to do here. Removal can only be effected for all
-	 * per cpu areas of a cpu at once.
-	 */
-}
-EXPORT_SYMBOL_GPL(percpu_depopulate);
-
-/**
- * percpu_depopulate_mask - depopulate per-cpu data for some cpu's
- * @__pdata: per-cpu data to depopulate
- * @mask: depopulate per-cpu data for cpu's selected through mask bits
- */
-void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
-{
-	/*
-	 * Nothing to do
-	 */
-}
-EXPORT_SYMBOL_GPL(__percpu_depopulate_mask);
-
-/**
- * percpu_populate - populate per-cpu data for given cpu
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @cpu: populate per-data for this cpu
- *
- * Populating per-cpu data for a cpu coming online would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- * Per-cpu object is populated with zeroed buffer.
+/*
+ * Allocate a per cpu array and zero all the per cpu objects.
+ * This is the externally visible function.
  */
-void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
-{
-	int pdata = (unsigned long)__percpu_disguise(__pdata);
-	void *p = (void *)per_cpu_offset(cpu) + pdata;
-
-	memset(p, 0, size);
-	return p;
-}
-EXPORT_SYMBOL_GPL(percpu_populate);
-
-/**
- * percpu_populate_mask - populate per-cpu data for more cpu's
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-cpu data for cpu's selected through mask bits
- *
- * Per-cpu objects are populated with zeroed buffers.
- */
-int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
-			   cpumask_t *mask)
-{
-	cpumask_t populated = CPU_MASK_NONE;
-	int cpu;
-
-	for_each_cpu_mask(cpu, *mask)
-		if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) {
-			__percpu_depopulate_mask(__pdata, &populated);
-			return -ENOMEM;
-		} else
-			cpu_set(cpu, populated);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(__percpu_populate_mask);
-
-/**
- * percpu_alloc_mask - initial setup of per-cpu data
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-data for cpu's selected through mask bits
- *
- * Populating per-cpu data for all online cpu's would be a typical use case,
- * which is simplified by the percpu_alloc() wrapper.
- * Per-cpu objects are populated with zeroed buffers.
- */
-void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
+void *__alloc_percpu(size_t size)
 {
 	void *pdata = cpu_alloc(size);
 	void *__pdata = __percpu_disguise(pdata);
+	int cpu;
 
 	if (unlikely(!pdata))
 		return NULL;
-	if (likely(!__percpu_populate_mask(__pdata, size, gfp, mask)))
-		return __pdata;
-	return NULL;
+
+	for_each_possible_cpu(cpu)
+		memset(per_cpu_ptr(__pdata, cpu) , 0, size);
+
+	return __pdata;
 }
-EXPORT_SYMBOL_GPL(__percpu_alloc_mask);
+EXPORT_SYMBOL_GPL(__alloc_percpu);
 
 /**
  * percpu_free - final cleanup of per-cpu data
  * @__pdata: object to clean up
- *
- * We simply clean up any per-cpu object left. No need for the client to
- * track and specify through a bis mask which per-cpu objects are to free.
  */
 void percpu_free(void *__pdata)
 {
Index: linux-2.6/net/iucv/iucv.c
===================================================================
--- linux-2.6.orig/net/iucv/iucv.c	2007-10-31 16:39:13.001121287 -0700
+++ linux-2.6/net/iucv/iucv.c	2007-10-31 16:40:14.892121256 -0700
@@ -556,25 +556,6 @@ static int __cpuinit iucv_cpu_notify(str
 	long cpu = (long) hcpu;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		if (!percpu_populate(iucv_irq_data,
-				     sizeof(struct iucv_irq_data),
-				     GFP_KERNEL|GFP_DMA, cpu))
-			return NOTIFY_BAD;
-		if (!percpu_populate(iucv_param, sizeof(union iucv_param),
-				     GFP_KERNEL|GFP_DMA, cpu)) {
-			percpu_depopulate(iucv_irq_data, cpu);
-			return NOTIFY_BAD;
-		}
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		percpu_depopulate(iucv_param, cpu);
-		percpu_depopulate(iucv_irq_data, cpu);
-		break;
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 	case CPU_DOWN_FAILED:
@@ -1617,16 +1598,18 @@ static int __init iucv_init(void)
 		rc = PTR_ERR(iucv_root);
 		goto out_bus;
 	}
-	/* Note: GFP_DMA used to get memory below 2G */
-	iucv_irq_data = percpu_alloc(sizeof(struct iucv_irq_data),
-				     GFP_KERNEL|GFP_DMA);
+	/*
+	 * Note: GFP_DMA used to get memory below 2G.
+	 *
+	 * The percpu data is below 2G right ? So this should work too -cl?
+	 */
+	iucv_irq_data = percpu_alloc(struct iucv_irq_data);
 	if (!iucv_irq_data) {
 		rc = -ENOMEM;
 		goto out_root;
 	}
 	/* Allocate parameter blocks. */
-	iucv_param = percpu_alloc(sizeof(union iucv_param),
-				  GFP_KERNEL|GFP_DMA);
+	iucv_param = percpu_alloc(union iucv_param);
 	if (!iucv_param) {
 		rc = -ENOMEM;
 		goto out_extint;

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
  2007-11-01  0:02 ` [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array Christoph Lameter
  2007-11-01  0:02 ` [patch 2/7] allocpercpu: Remove functions that are rarely used Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  7:25   ` Eric Dumazet
  2007-11-01  0:02 ` [patch 4/7] Percpu: Add support for this_cpu_offset() to be able to create this_cpu_ptr() Christoph Lameter
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: opt_disguise --]
[-- Type: text/plain, Size: 762 bytes --]

Disguising costs a few cycles in the hot paths. So switch it off if
we are not debuggin.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/percpu.h |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:40:14.892121256 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:41:00.907621059 -0700
@@ -33,7 +33,11 @@
 
 #ifdef CONFIG_SMP
 
+#ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
+#else
+#define __percpu_disguide(pdata) ((void *)(pdata))
+#endif
 
 /* 
  * Use this to get to a cpu's version of the per-cpu object dynamically

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 4/7] Percpu: Add support for this_cpu_offset() to be able to create this_cpu_ptr()
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-11-01  0:02 ` [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  0:02 ` [patch 5/7] SLUB: Use allocpercpu to allocate per cpu data instead of running our own per cpu allocator Christoph Lameter
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: this_cpu --]
[-- Type: text/plain, Size: 6679 bytes --]

Support for this_cpu_ptr() is important for those arches that allow a faster
way to get to the per cpu area of the local processor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/asm-generic/percpu.h |    4 ++++
 include/asm-ia64/percpu.h    |    3 +++
 include/asm-powerpc/percpu.h |    3 +++
 include/asm-s390/percpu.h    |    4 ++++
 include/asm-sparc64/percpu.h |    2 ++
 include/asm-x86/percpu_32.h  |    2 ++
 include/asm-x86/percpu_64.h  |    4 ++++
 include/linux/percpu.h       |    7 +++++++
 8 files changed, 29 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:41:00.907621059 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:42:45.748121446 -0700
@@ -51,6 +51,13 @@
     	(__typeof__(ptr))(p + q);			\
 })
 
+#define this_cpu_ptr(ptr)           			\
+({							\
+	void *p = ptr;					\
+    	(__typeof__(ptr))(p + this_cpu_offset());	\
+})
+
+
 extern void *__alloc_percpu(size_t size);
 extern void percpu_free(void *__pdata);
 
Index: linux-2.6/include/asm-generic/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-generic/percpu.h	2007-10-31 16:36:43.452121172 -0700
+++ linux-2.6/include/asm-generic/percpu.h	2007-10-31 16:42:45.748121446 -0700
@@ -26,6 +26,8 @@ extern unsigned long __per_cpu_offset[NR
 #define __get_cpu_var(var) per_cpu(var, smp_processor_id())
 #define __raw_get_cpu_var(var) per_cpu(var, raw_smp_processor_id())
 
+#define this_cpu_offset() __per_cpu_offset(raw_smp_processor_id())
+
 /* A macro to avoid #include hell... */
 #define percpu_modcopy(pcpudst, src, size)			\
 do {								\
@@ -53,4 +55,6 @@ do {								\
 #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
 #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)
 
+#define this_cpu_offset() 0
+
 #endif /* _ASM_GENERIC_PERCPU_H_ */
Index: linux-2.6/include/asm-ia64/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/percpu.h	2007-10-31 16:36:43.460121335 -0700
+++ linux-2.6/include/asm-ia64/percpu.h	2007-10-31 16:42:45.748121446 -0700
@@ -51,6 +51,8 @@ extern unsigned long __per_cpu_offset[NR
 /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
 DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
 
+#define this_cpu_offset() __ia64_per_cpu_var(local_per_cpu_offset)
+
 #define per_cpu(var, cpu)  (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
 #define __get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset)))
 #define __raw_get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset)))
@@ -65,6 +67,7 @@ extern void *per_cpu_init(void);
 #define __get_cpu_var(var)			per_cpu__##var
 #define __raw_get_cpu_var(var)			per_cpu__##var
 #define per_cpu_init()				(__phys_per_cpu_start)
+#define this_cpu_offset()			0
 
 #endif	/* SMP */
 
Index: linux-2.6/include/asm-powerpc/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/percpu.h	2007-10-31 16:36:43.464121161 -0700
+++ linux-2.6/include/asm-powerpc/percpu.h	2007-10-31 16:42:45.748121446 -0700
@@ -16,6 +16,8 @@
 #define __my_cpu_offset() get_paca()->data_offset
 #define per_cpu_offset(x) (__per_cpu_offset(x))
 
+#define this_cpu_offset() __my_cpu_offset()
+
 /* Separate out the type, so (int[3], foo) works. */
 #define DEFINE_PER_CPU(type, name) \
     __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
@@ -51,6 +53,7 @@ extern void setup_per_cpu_areas(void);
 #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu__##var))
 #define __get_cpu_var(var)			per_cpu__##var
 #define __raw_get_cpu_var(var)			per_cpu__##var
+#define this_cpu_offset()			0
 
 #endif	/* SMP */
 
Index: linux-2.6/include/asm-s390/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-s390/percpu.h	2007-10-31 16:36:43.472121072 -0700
+++ linux-2.6/include/asm-s390/percpu.h	2007-10-31 16:42:45.779370925 -0700
@@ -51,6 +51,8 @@ extern unsigned long __per_cpu_offset[NR
 #define per_cpu(var,cpu) __reloc_hide(var,__per_cpu_offset[cpu])
 #define per_cpu_offset(x) (__per_cpu_offset[x])
 
+#define this_cpu_offset() S390_lowcore.percpu_offset
+
 /* A macro to avoid #include hell... */
 #define percpu_modcopy(pcpudst, src, size)			\
 do {								\
@@ -71,6 +73,8 @@ do {								\
 #define __raw_get_cpu_var(var) __reloc_hide(var,0)
 #define per_cpu(var,cpu) __reloc_hide(var,0)
 
+#define this_cpu_offset() 0
+
 #endif /* SMP */
 
 #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
Index: linux-2.6/include/asm-sparc64/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/percpu.h	2007-10-31 16:36:43.480121400 -0700
+++ linux-2.6/include/asm-sparc64/percpu.h	2007-10-31 16:42:45.779370925 -0700
@@ -5,6 +5,8 @@
 
 register unsigned long __local_per_cpu_offset asm("g5");
 
+#define this_cpu_offset() __local_per_cpu_offset
+
 #ifdef CONFIG_SMP
 
 #define setup_per_cpu_areas()			do { } while (0)
Index: linux-2.6/include/asm-x86/percpu_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_32.h	2007-10-31 16:36:43.484121314 -0700
+++ linux-2.6/include/asm-x86/percpu_32.h	2007-10-31 16:42:45.779370925 -0700
@@ -72,6 +72,8 @@ DECLARE_PER_CPU(unsigned long, this_cpu_
 	RELOC_HIDE(&per_cpu__##var, x86_read_percpu(this_cpu_off));	\
 }))
 
+#define this_cpu_offset() x86_read_percpu(this_cpu_off)
+
 #define __get_cpu_var(var) __raw_get_cpu_var(var)
 
 /* A macro to avoid #include hell... */
Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h	2007-10-31 16:36:43.492121152 -0700
+++ linux-2.6/include/asm-x86/percpu_64.h	2007-10-31 16:42:45.779370925 -0700
@@ -14,6 +14,8 @@
 #define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
 #define __my_cpu_offset() read_pda(data_offset)
 
+#define this_cpu_offset() read_pda(data_offset)
+
 #define per_cpu_offset(x) (__per_cpu_offset(x))
 
 /* Separate out the type, so (int[3], foo) works. */
@@ -58,6 +60,8 @@ extern void setup_per_cpu_areas(void);
 #define __get_cpu_var(var)			per_cpu__##var
 #define __raw_get_cpu_var(var)			per_cpu__##var
 
+#define this_cpu_offset() 0
+
 #endif	/* SMP */
 
 #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 5/7] SLUB: Use allocpercpu to allocate per cpu data instead of running our own per cpu allocator
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-11-01  0:02 ` [patch 4/7] Percpu: Add support for this_cpu_offset() to be able to create this_cpu_ptr() Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  0:02 ` [patch 6/7] SLUB: No need to cache kmem_cache data in kmem_cache_cpu anymore Christoph Lameter
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: slub_alloc --]
[-- Type: text/plain, Size: 7240 bytes --]

Using allocpercpu removes the needs for the per cpu arrays in the kmem_cache struct.
These could get quite big if we have to support system of up to thousands of cpus.
The use of alloc_percpu means that:

1. The size of kmem_cache for SMP configuration shrinks since we will only
need 1 pointer instead of NR_CPUS. The same pointer can be used by all
processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
system meaning less memory overhead for configurations that may potentially
support up to 1k NUMA nodes.

3. We can remove the diddle widdle with allocating and releasing kmem_cache_cpu
   structures when bringing up and shuttting down cpus. The allocpercpu
   logic will do it all for us.

4. Fastpath performance increases by another 20% vs. the earlier improvements.
   Instead of having fastpath with 40-50 cycles we are now in the 30-40 range.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |   11 ++--
 mm/slub.c                |  125 ++++-------------------------------------------
 2 files changed, 18 insertions(+), 118 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-10-30 16:34:41.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2007-10-31 09:23:26.000000000 -0700
@@ -34,6 +34,12 @@ struct kmem_cache_node {
  * Slab cache management.
  */
 struct kmem_cache {
+#ifdef CONFIG_SMP
+	/* Per cpu pointer usable for any cpu */
+	struct kmem_cache_cpu *cpu_slab;
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -63,11 +69,6 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-10-30 22:52:24.000000000 -0700
+++ linux-2.6/mm/slub.c	2007-10-31 09:45:59.000000000 -0700
@@ -242,7 +242,7 @@ static inline struct kmem_cache_node *ge
 static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
 {
 #ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
+	return percpu_ptr(s->cpu_slab, cpu);
 #else
 	return &s->cpu_slab;
 #endif
@@ -2032,119 +2032,25 @@ static void init_kmem_cache_node(struct 
 }
 
 #ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu,
-				kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE;
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
 static void free_kmem_cache_cpus(struct kmem_cache *s)
 {
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
+	percpu_free(s->cpu_slab);
 }
 
 static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
 
-		if (c)
-			continue;
+	if (!s->cpu_slab)
+		return 0;
 
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
+	for_each_online_cpu(cpu)
+		init_kmem_cache_cpu(s, get_cpu_slab(s, cpu));
 	return 1;
 }
 
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpu_isset(cpu, kmem_cach_cpu_free_init_once))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpu_set(cpu, kmem_cach_cpu_free_init_once);
-}
-
-static void __init init_alloc_cpu(void)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
-
 #else
 static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
 static inline void init_alloc_cpu(void) {}
@@ -2974,8 +2880,6 @@ void __init kmem_cache_init(void)
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3035,11 +2939,12 @@ void __init kmem_cache_init(void)
 	for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
-
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3181,11 +3086,9 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, get_cpu_slab(s, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3195,13 +3098,9 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 6/7] SLUB: No need to cache kmem_cache data in kmem_cache_cpu anymore
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-11-01  0:02 ` [patch 5/7] SLUB: Use allocpercpu to allocate per cpu data instead of running our own per cpu allocator Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  0:02 ` [patch 7/7] SLUB: Optimize per cpu access on the local cpu using this_cpu_ptr() Christoph Lameter
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: slub_reduce --]
[-- Type: text/plain, Size: 5877 bytes --]

Remove the fields in kmem_cache_cpu that were used to cache data from
kmem_cache when they were in different cachelines. The cacheline that holds
the per cpu array pointer now also holds these values. We can cut down the
kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the field does not require an additional cacheline anymore. This results
in consistent use of setting the freepointer for objects throughout SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    3 --
 mm/slub.c                |   50 +++++++++++++++--------------------------------
 2 files changed, 16 insertions(+), 37 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-10-31 12:57:00.131421982 -0700
+++ linux-2.6/include/linux/slub_def.h	2007-10-31 12:57:43.446922264 -0700
@@ -15,9 +15,6 @@ struct kmem_cache_cpu {
 	void **freelist;
 	struct page *page;
 	int node;
-	unsigned int offset;
-	unsigned int objsize;
-	unsigned int objects;
 };
 
 struct kmem_cache_node {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-10-31 12:57:00.131421982 -0700
+++ linux-2.6/mm/slub.c	2007-10-31 12:57:43.458921888 -0700
@@ -282,13 +282,6 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1446,10 +1439,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1606,8 +1599,8 @@ load_freelist:
 		goto debug;
 
 	object = c->page->freelist;
-	c->freelist = object[c->offset];
-	c->page->inuse = c->objects;
+	c->freelist = get_freepointer(s, object);
+	c->page->inuse = s->objects;
 	c->page->freelist = c->page->end;
 	c->node = page_to_nid(c->page);
 unlock_out:
@@ -1635,7 +1628,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1668,8 +1661,8 @@ static void __always_inline *slab_alloc(
 			}
 			break;
 		}
-	} while (cmpxchg_local(&c->freelist, object, object[c->offset])
-								!= object);
+	} while (cmpxchg_local(&c->freelist, object,
+			get_freepointer(s, object)) != object);
 	put_cpu();
 #else
 	unsigned long flags;
@@ -1685,13 +1678,13 @@ static void __always_inline *slab_alloc(
 		}
 	} else {
 		object = c->freelist;
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 	}
 	local_irq_restore(flags);
 #endif
 
 	if (unlikely((gfpflags & __GFP_ZERO)))
-		memset(object, 0, c->objsize);
+		memset(object, 0, s->objsize);
 out:
 	return object;
 }
@@ -1719,7 +1712,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-				void *x, void *addr, unsigned int offset)
+				void *x, void *addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1735,7 +1728,8 @@ static void __slab_free(struct kmem_cach
 	if (unlikely(state & SLABDEBUG))
 		goto debug;
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1817,10 +1811,10 @@ static void __always_inline slab_free(st
 		 * since the freelist pointers are unique per slab.
 		 */
 		if (unlikely(page != c->page || c->node < 0)) {
-			__slab_free(s, page, x, addr, c->offset);
+			__slab_free(s, page, x, addr);
 			break;
 		}
-		object[c->offset] = freelist;
+		set_freepointer(s, object, freelist);
 	} while (cmpxchg_local(&c->freelist, freelist, object) != freelist);
 	put_cpu();
 #else
@@ -1830,10 +1824,10 @@ static void __always_inline slab_free(st
 	debug_check_no_locks_freed(object, s->objsize);
 	c = get_cpu_slab(s, smp_processor_id());
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 #endif
@@ -2015,9 +2009,6 @@ static void init_kmem_cache_cpu(struct k
 	c->page = NULL;
 	c->freelist = (void *)PAGE_MAPPING_ANON;
 	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-	c->objects = s->objects;
 }
 
 static void init_kmem_cache_node(struct kmem_cache_node *n)
@@ -3027,21 +3018,12 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
-
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 		if (sysfs_slab_alias(s, name))

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 7/7] SLUB: Optimize per cpu access on the local cpu using this_cpu_ptr()
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-11-01  0:02 ` [patch 6/7] SLUB: No need to cache kmem_cache data in kmem_cache_cpu anymore Christoph Lameter
@ 2007-11-01  0:02 ` Christoph Lameter
  2007-11-01  0:24 ` [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead David Miller
  2007-11-01  7:17 ` Eric Dumazet
  8 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: slub_this --]
[-- Type: text/plain, Size: 2686 bytes --]

Use this_cpu_ptr to optimize access to the per cpu area in the fastpaths.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-10-31 14:00:28.635673087 -0700
+++ linux-2.6/mm/slub.c	2007-10-31 14:01:29.803422492 -0700
@@ -248,6 +248,15 @@ static inline struct kmem_cache_cpu *get
 #endif
 }
 
+static inline struct kmem_cache_cpu *this_cpu_slab(struct kmem_cache *s)
+{
+#ifdef CONFIG_SMP
+	return this_cpu_ptr(s->cpu_slab);
+#else
+	return &s->cpu_slab;
+#endif
+}
+
 /*
  * The end pointer in a slab is special. It points to the first object in the
  * slab but has bit 0 set to mark it.
@@ -1521,7 +1530,7 @@ static noinline unsigned long get_new_sl
 	if (!page)
 		return 0;
 
-	*pc = c = get_cpu_slab(s, smp_processor_id());
+	*pc = c = this_cpu_slab(s);
 	if (c->page) {
 		/*
 		 * Someone else populated the cpu_slab while we
@@ -1650,25 +1659,26 @@ static void __always_inline *slab_alloc(
 	struct kmem_cache_cpu *c;
 
 #ifdef CONFIG_FAST_CMPXCHG_LOCAL
-	c = get_cpu_slab(s, get_cpu());
+	preempt_disable();
+	c = this_cpu_slab(s);
 	do {
 		object = c->freelist;
 		if (unlikely(is_end(object) || !node_match(c, node))) {
 			object = __slab_alloc(s, gfpflags, node, addr, c);
 			if (unlikely(!object)) {
-				put_cpu();
+				preempt_enable();
 				goto out;
 			}
 			break;
 		}
 	} while (cmpxchg_local(&c->freelist, object,
 			get_freepointer(s, object)) != object);
-	put_cpu();
+	preempt_enable();
 #else
 	unsigned long flags;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = this_cpu_slab(s);
 	if (unlikely((is_end(c->freelist)) || !node_match(c, node))) {
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
@@ -1794,7 +1804,8 @@ static void __always_inline slab_free(st
 #ifdef CONFIG_FAST_CMPXCHG_LOCAL
 	void **freelist;
 
-	c = get_cpu_slab(s, get_cpu());
+	preempt_disable();
+	c = this_cpu_slab(s);
 	debug_check_no_locks_freed(object, s->objsize);
 	do {
 		freelist = c->freelist;
@@ -1816,13 +1827,13 @@ static void __always_inline slab_free(st
 		}
 		set_freepointer(s, object, freelist);
 	} while (cmpxchg_local(&c->freelist, freelist, object) != freelist);
-	put_cpu();
+	preempt_enable();
 #else
 	unsigned long flags;
 
 	local_irq_save(flags);
 	debug_check_no_locks_freed(object, s->objsize);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = this_cpu_slab(s);
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-11-01  0:02 ` [patch 7/7] SLUB: Optimize per cpu access on the local cpu using this_cpu_ptr() Christoph Lameter
@ 2007-11-01  0:24 ` David Miller
  2007-11-01  0:26   ` Christoph Lameter
  2007-11-01  7:17 ` Eric Dumazet
  8 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  0:24 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg


Are these patches against -mm or mainline?

I get a lot of rejects starting with patch 6 against
mainline and I really wanted to test them out on sparc64.

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:24 ` [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead David Miller
@ 2007-11-01  0:26   ` Christoph Lameter
  2007-11-01  0:27     ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:26 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Wed, 31 Oct 2007, David Miller wrote:

> 
> Are these patches against -mm or mainline?
> 
> I get a lot of rejects starting with patch 6 against
> mainline and I really wanted to test them out on sparc64.

Hmmm... They are against the current slab performance head (which is in mm 
but it has not been released yet ;-).

Do 

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
performance

and then you should be able to apply these patches.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:26   ` Christoph Lameter
@ 2007-11-01  0:27     ` David Miller
  2007-11-01  0:31       ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  0:27 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 17:26:16 -0700 (PDT)

> Do 
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
> performance
> 
> and then you should be able to apply these patches.

Thanks a lot Chrisoph.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:27     ` David Miller
@ 2007-11-01  0:31       ` Christoph Lameter
  2007-11-01  0:51         ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:31 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Wed, 31 Oct 2007, David Miller wrote:

> > git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
> > performance
> > 
> > and then you should be able to apply these patches.
> 
> Thanks a lot Chrisoph.

Others may have the same issue.

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git allocpercpu

should get you the whole thing.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:31       ` Christoph Lameter
@ 2007-11-01  0:51         ` David Miller
  2007-11-01  0:53           ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  0:51 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 17:31:12 -0700 (PDT)

> Others may have the same issue.
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git allocpercpu
> 
> should get you the whole thing.

This patch fixes build failures with DEBUG_VM disabled.

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 4b167c0..d414703 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -36,7 +36,7 @@
 #ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 #else
-#define __percpu_disguide(pdata) ((void *)(pdata))
+#define __percpu_disguise(pdata) ((void *)(pdata))
 #endif
 
 /* 

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:51         ` David Miller
@ 2007-11-01  0:53           ` Christoph Lameter
  2007-11-01  1:00             ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  0:53 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

> This patch fixes build failures with DEBUG_VM disabled.

Well there is more there. Last minute mods sigh. With DEBUG_VM you likely 
need this patch.


---
 include/linux/percpu.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 17:48:38.020499686 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 17:51:01.423372247 -0700
@@ -36,7 +36,7 @@
 #ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 #else
-#define __percpu_disguide(pdata) ((void *)(pdata))
+#define __percpu_disguise(pdata) ((void *)(pdata))
 #endif
 
 /* 
@@ -53,7 +53,7 @@
 
 #define this_cpu_ptr(ptr)           			\
 ({							\
-	void *p = ptr;					\
+	void *p = __percpu_disguise(ptr);		\
     	(__typeof__(ptr))(p + this_cpu_offset());	\
 })
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:53           ` Christoph Lameter
@ 2007-11-01  1:00             ` David Miller
  2007-11-01  1:01               ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  1:00 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 17:53:23 -0700 (PDT)

> > This patch fixes build failures with DEBUG_VM disabled.
> 
> Well there is more there. Last minute mods sigh. With DEBUG_VM you likely 
> need this patch.

Without DEBUG_VM I get a loop of crashes shortly after SSHD
is started, I'll try to track it down.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:00             ` David Miller
@ 2007-11-01  1:01               ` Christoph Lameter
  2007-11-01  1:09                 ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  1:01 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Wed, 31 Oct 2007, David Miller wrote:

> Without DEBUG_VM I get a loop of crashes shortly after SSHD
> is started, I'll try to track it down.

Check how much per cpu memory is in use by

cat /proc/vmstat

currently we have a 32k limit there.
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:01               ` Christoph Lameter
@ 2007-11-01  1:09                 ` David Miller
  2007-11-01  1:12                   ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  1:09 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 18:01:34 -0700 (PDT)

> On Wed, 31 Oct 2007, David Miller wrote:
> 
> > Without DEBUG_VM I get a loop of crashes shortly after SSHD
> > is started, I'll try to track it down.
> 
> Check how much per cpu memory is in use by
> 
> cat /proc/vmstat
> 
> currently we have a 32k limit there.

It crashes when SSHD starts, the serial console GETTY hasn't
started up yet so I can't even log in to run those commands
Christoph.

All I can do now is bisect and then try to figure out what about the
guilty change might cause the problem.

This is on a 64-cpu sparc64 box, and fast cmpxchg local is not set, so
maybe it's one of the locking changes.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:09                 ` David Miller
@ 2007-11-01  1:12                   ` Christoph Lameter
  2007-11-01  1:13                     ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  1:12 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Wed, 31 Oct 2007, David Miller wrote:

> It crashes when SSHD starts, the serial console GETTY hasn't
> started up yet so I can't even log in to run those commands
> Christoph.

Hmmm... Bad.

> All I can do now is bisect and then try to figure out what about the
> guilty change might cause the problem.

Reverting the 7th patch should avoid using the sparc register that caches 
the per cpu area offset? (I though so, does it?)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:12                   ` Christoph Lameter
@ 2007-11-01  1:13                     ` David Miller
  2007-11-01  1:21                       ` Christoph Lameter
  2007-11-01  4:16                       ` Christoph Lameter
  0 siblings, 2 replies; 62+ messages in thread
From: David Miller @ 2007-11-01  1:13 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)

> On Wed, 31 Oct 2007, David Miller wrote:
> 
> > All I can do now is bisect and then try to figure out what about the
> > guilty change might cause the problem.
> 
> Reverting the 7th patch should avoid using the sparc register that caches 
> the per cpu area offset? (I though so, does it?)

Yes, that's right, %g5 holds the local cpu's per-cpu offset.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:13                     ` David Miller
@ 2007-11-01  1:21                       ` Christoph Lameter
  2007-11-01  5:27                         ` David Miller
  2007-11-01  4:16                       ` Christoph Lameter
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  1:21 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Wed, 31 Oct 2007, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)
> 
> > On Wed, 31 Oct 2007, David Miller wrote:
> > 
> > > All I can do now is bisect and then try to figure out what about the
> > > guilty change might cause the problem.
> > 
> > Reverting the 7th patch should avoid using the sparc register that caches 
> > the per cpu area offset? (I though so, does it?)
> 
> Yes, that's right, %g5 holds the local cpu's per-cpu offset.

And if I add the address of a percpu variable then I get to the variable 
for this cpu right?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:13                     ` David Miller
  2007-11-01  1:21                       ` Christoph Lameter
@ 2007-11-01  4:16                       ` Christoph Lameter
  2007-11-01  5:38                         ` David Miller
  2007-11-01  7:01                         ` David Miller
  1 sibling, 2 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01  4:16 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

Hmmmm... Got this to run on an ia64 big iron. One problem is the sizing of 
the pool. Somehow this needs to be dynamic.

Apply this fix on top of the others.

---
 include/asm-ia64/page.h   |    2 +-
 include/asm-ia64/percpu.h |    9 ++++++---
 mm/allocpercpu.c          |   12 ++++++++++--
 3 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/allocpercpu.c
===================================================================
--- linux-2.6.orig/mm/allocpercpu.c	2007-10-31 20:53:16.565486654 -0700
+++ linux-2.6/mm/allocpercpu.c	2007-10-31 21:00:27.553486484 -0700
@@ -28,7 +28,12 @@
 /*
  * Maximum allowed per cpu data per cpu
  */
+#ifdef CONFIG_NUMA
+#define PER_CPU_ALLOC_SIZE (32768 + MAX_NUMNODES * 512)
+#else
 #define PER_CPU_ALLOC_SIZE 32768
+#endif
+
 
 #define UNIT_SIZE sizeof(unsigned long long)
 #define UNITS_PER_CPU (PER_CPU_ALLOC_SIZE / UNIT_SIZE)
@@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
 
 static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
 static DEFINE_SPINLOCK(cpu_alloc_map_lock);
-static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
+static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
 
 #define CPU_DATA_OFFSET ((unsigned long)&per_cpu__cpu_area)
 
@@ -97,8 +102,11 @@ static void *cpu_alloc(unsigned long siz
 		while (start < UNITS_PER_CPU &&
 				cpu_alloc_map[start] != FREE)
 			start++;
-		if (start == UNITS_PER_CPU)
+		if (start == UNITS_PER_CPU) {
+			spin_unlock(&cpu_alloc_map_lock);
+			printk(KERN_CRIT "Dynamic per cpu memory exhausted\n");
 			return NULL;
+		}
 
 		end = start + 1;
 		while (end < UNITS_PER_CPU && end - start < units &&
Index: linux-2.6/include/asm-ia64/page.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/page.h	2007-10-31 20:53:16.573486483 -0700
+++ linux-2.6/include/asm-ia64/page.h	2007-10-31 20:56:19.372870091 -0700
@@ -44,7 +44,7 @@
 #define PAGE_MASK		(~(PAGE_SIZE - 1))
 #define PAGE_ALIGN(addr)	(((addr) + PAGE_SIZE - 1) & PAGE_MASK)
 
-#define PERCPU_PAGE_SHIFT	16	/* log2() of max. size of per-CPU area */
+#define PERCPU_PAGE_SHIFT	20	/* log2() of max. size of per-CPU area */
 #define PERCPU_PAGE_SIZE	(__IA64_UL_CONST(1) << PERCPU_PAGE_SHIFT)
 
 
Index: linux-2.6/include/asm-ia64/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/percpu.h	2007-10-31 20:53:30.424553062 -0700
+++ linux-2.6/include/asm-ia64/percpu.h	2007-10-31 20:53:36.248486656 -0700
@@ -40,6 +40,12 @@
 #endif
 
 /*
+ * This will make per cpu access to the local area use the virtually mapped
+ * areas.
+ */
+#define this_cpu_offset()			0
+
+/*
  * Pretty much a literal copy of asm-generic/percpu.h, except that percpu_modcopy() is an
  * external routine, to avoid include-hell.
  */
@@ -51,8 +57,6 @@ extern unsigned long __per_cpu_offset[NR
 /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
 DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
 
-#define this_cpu_offset() __ia64_per_cpu_var(local_per_cpu_offset)
-
 #define per_cpu(var, cpu)  (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
 #define __get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset)))
 #define __raw_get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset)))
@@ -67,7 +71,6 @@ extern void *per_cpu_init(void);
 #define __get_cpu_var(var)			per_cpu__##var
 #define __raw_get_cpu_var(var)			per_cpu__##var
 #define per_cpu_init()				(__phys_per_cpu_start)
-#define this_cpu_offset()			0
 
 #endif	/* SMP */
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  1:21                       ` Christoph Lameter
@ 2007-11-01  5:27                         ` David Miller
  0 siblings, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-01  5:27 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 18:21:02 -0700 (PDT)

> On Wed, 31 Oct 2007, David Miller wrote:
> 
> > From: Christoph Lameter <clameter@sgi.com>
> > Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)
> > 
> > > On Wed, 31 Oct 2007, David Miller wrote:
> > > 
> > > > All I can do now is bisect and then try to figure out what about the
> > > > guilty change might cause the problem.
> > > 
> > > Reverting the 7th patch should avoid using the sparc register that caches 
> > > the per cpu area offset? (I though so, does it?)
> > 
> > Yes, that's right, %g5 holds the local cpu's per-cpu offset.
> 
> And if I add the address of a percpu variable then I get to the variable 
> for this cpu right?

Right.

I bisected the crash down to:

	[PATCH] newallocpercpu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  4:16                       ` Christoph Lameter
@ 2007-11-01  5:38                         ` David Miller
  2007-11-01  7:01                         ` David Miller
  1 sibling, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-01  5:38 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)

>  /*
>   * Maximum allowed per cpu data per cpu
>   */
> +#ifdef CONFIG_NUMA
> +#define PER_CPU_ALLOC_SIZE (32768 + MAX_NUMNODES * 512)
> +#else
>  #define PER_CPU_ALLOC_SIZE 32768
> +#endif
> +

Christoph, as Rusty found out years ago when he first wrote this code,
you cannot put hard limits on the alloc_percpu() allocations.

They can be done by anyone, any module, and since there was no limit
before you cannot reasonably add one now.

As just one of many examples, several networking devices use
alloc_percpu() for each instance they bring up.  This alone can
request arbitrary amounts of per-cpu data.

Therefore, you'll need to do your optimization without imposing any
size limits.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  4:16                       ` Christoph Lameter
  2007-11-01  5:38                         ` David Miller
@ 2007-11-01  7:01                         ` David Miller
  2007-11-01  9:14                           ` David Miller
  1 sibling, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  7:01 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)

> Index: linux-2.6/mm/allocpercpu.c
> ===================================================================
> --- linux-2.6.orig/mm/allocpercpu.c	2007-10-31 20:53:16.565486654 -0700
> +++ linux-2.6/mm/allocpercpu.c	2007-10-31 21:00:27.553486484 -0700
 ...
> @@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
>  
>  static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
>  static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> -static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
> +static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
>  
>  #define CPU_DATA_OFFSET ((unsigned long)&per_cpu__cpu_area)
>  

This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
in some other fashion soon afterwards.

I'll try to debug this some more later, I've dumped enough time into
this already :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-11-01  0:24 ` [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead David Miller
@ 2007-11-01  7:17 ` Eric Dumazet
  2007-11-01  7:57   ` David Miller
  2007-11-01 12:57   ` Christoph Lameter
  8 siblings, 2 replies; 62+ messages in thread
From: Eric Dumazet @ 2007-11-01  7:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

Christoph Lameter a écrit :
> This patch increases the speed of the SLUB fastpath by
> improving the per cpu allocator and makes it usable for SLUB.
> 
> Currently allocpercpu manages arrays of pointer to per cpu objects.
> This means that is has to allocate the arrays and then populate them
> as needed with objects. Although these objects are called per cpu
> objects they cannot be handled in the same way as per cpu objects
> by adding the per cpu offset of the respective cpu.
> 
> The patch here changes that. We create a small memory pool in the
> percpu area and allocate from there if alloc per cpu is called.
> As a result we do not need the per cpu pointer arrays for each
> object. This reduces memory usage and also the cache foot print
> of allocpercpu users. Also the per cpu objects for a single processor
> are tightly packed next to each other decreasing cache footprint
> even further and making it possible to access multiple objects
> in the same cacheline.
> 
> SLUB has the same mechanism implemented. After fixing up the
> alloccpu stuff we throw the SLUB method out and use the new
> allocpercpu handling. Then we optimize allocpercpu addressing
> by adding a new function
> 
> 	this_cpu_ptr()
> 
> that allows the determination of the per cpu pointer for the
> current processor in an more efficient way on many platforms.
> 
> This increases the speed of SLUB (and likely other kernel subsystems
> that benefit from the allocpercpu enhancements):
> 
> 
>        SLAB    SLUB    SLUB+   SLUB-o	SLUB-a
>    8    96      86      45      44      38	3 *
>   16    84      92      49      48      43	2 *
>   32    84      106     61      59      53	+++
>   64    102     129     82      88      75	++
>  128    147     226     188     181     176	-
>  256    200     248     207     285     204	=
>  512    300     301     260     209     250	+
> 1024    416     440     398     264     391	++
> 2048    720     542     530     390     511	+++
> 4096    1254    342     342     336     376	3 *
> 
> alloc/free test
>       SLAB    SLUB    SLUB+   SLUB-o	SLUB-a
>       137-146 151     68-72   68-74	56-58	3 *
> 
> Note: The per cpu optimization are only half way there because of the screwed
> up way that x86_64 handles its cpu area that causes addditional cycles to be
> spend by retrieving a pointer from memory and adding it to the address.
> The i386 code is much less cycle intensive being able to get to per cpu
> data using a segment prefix and if we can get that to work on x86_64
> then we may be able to get the cycle count for the fastpath down to 20-30
> cycles.
> 

Really sounds good Christoph, not only for SLUB, so I guess the 32k limit is 
not enough because many things will use per_cpu if only per_cpu was reasonably 
fast (ie not so many dereferences)

I think this question already came in the past and Linus already answered it, 
but I again ask it. What about VM games with modern cpus (64 bits arches)

Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout 
so that each cpu maps its own per_cpu area on this area, so that the local 
per_cpu data sits in the same virtual address on each cpu. Then we dont need a 
segment prefix nor adding a 'per_cpu offset'. No need to write special asm 
functions to read/write/increment a per_cpu data and gcc could use normal 
rules for optimizations.

We only would need adding "per_cpu offset" to get data for a given cpu.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array
  2007-11-01  0:02 ` [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array Christoph Lameter
@ 2007-11-01  7:24   ` Eric Dumazet
  2007-11-01 12:59     ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2007-11-01  7:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

Christoph Lameter a écrit :
> +
> +enum unit_type { FREE, END, USED };
> +
> +static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };

You mean END here instead of 1 :)


> +/*
> + * Allocate an object of a certain size
> + *
> + * Returns a per cpu pointer that must not be directly used.
> + */
> +static void *cpu_alloc(unsigned long size)
> +{

We might need to give an alignment constraint here. Some per_cpu users would 
like to get a 64 bytes zone, siting in one cache line and not two :)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set
  2007-11-01  0:02 ` [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set Christoph Lameter
@ 2007-11-01  7:25   ` Eric Dumazet
  0 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2007-11-01  7:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

Christoph Lameter a écrit :
> Disguising costs a few cycles in the hot paths. So switch it off if
> we are not debuggin.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/percpu.h |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:40:14.892121256 -0700
> +++ linux-2.6/include/linux/percpu.h	2007-10-31 16:41:00.907621059 -0700
> @@ -33,7 +33,11 @@
>  
>  #ifdef CONFIG_SMP
>  
> +#ifdef CONFIG_DEBUG_VM
>  #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
> +#else
> +#define __percpu_disguide(pdata) ((void *)(pdata))
> +#endif

Yes good idea, but a litle typo here :)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  7:17 ` Eric Dumazet
@ 2007-11-01  7:57   ` David Miller
  2007-11-01 13:01     ` Christoph Lameter
  2007-11-01 12:57   ` Christoph Lameter
  1 sibling, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  7:57 UTC (permalink / raw)
  To: dada1; +Cc: clameter, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 01 Nov 2007 08:17:58 +0100

> Say we reserve on x86_64 a really huge (2^32 bytes) area, and change
> VM layout so that each cpu maps its own per_cpu area on this area,
> so that the local per_cpu data sits in the same virtual address on
> each cpu.

This is a mechanism used partially on IA64 already.

I think you have to be very careful, and you can only use this per-cpu
fixed virtual address area in extremely limited cases.

The reason is, I think the address matters, consider list heads, for
example.

So you couldn't do:

	list_add(&obj->list, &per_cpu_ptr(list_head));

and use that per-cpu fixed virtual address.

IA64 seems to use it universally for every __get_cpu_var()
access, so maybe it works out somehow :-)))

I guess if list modifications by remote cpus are disallowed, it would
work (list traversal works because using the fixed virtual address as
the list head sentinal is OK), but that is an extremely fragile
assumption to base the entire mechanism upon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  7:01                         ` David Miller
@ 2007-11-01  9:14                           ` David Miller
  2007-11-01 13:03                             ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01  9:14 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: David Miller <davem@davemloft.net>
Date: Thu, 01 Nov 2007 00:01:18 -0700 (PDT)

> From: Christoph Lameter <clameter@sgi.com>
> Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)
> 
> > Index: linux-2.6/mm/allocpercpu.c
> > ===================================================================
> > --- linux-2.6.orig/mm/allocpercpu.c	2007-10-31 20:53:16.565486654 -0700
> > +++ linux-2.6/mm/allocpercpu.c	2007-10-31 21:00:27.553486484 -0700
>  ...
> > @@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
> >  
> >  static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
> >  static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> > -static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
> > +static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
> >  
> >  #define CPU_DATA_OFFSET ((unsigned long)&per_cpu__cpu_area)
> >  
> 
> This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
> in some other fashion soon afterwards.

And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.

You'll definitely need to make this work dynamically somehow.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  7:17 ` Eric Dumazet
  2007-11-01  7:57   ` David Miller
@ 2007-11-01 12:57   ` Christoph Lameter
  2007-11-01 21:28     ` David Miller
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 12:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: akpm, linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

On Thu, 1 Nov 2007, Eric Dumazet wrote:

> I think this question already came in the past and Linus already answered it,
> but I again ask it. What about VM games with modern cpus (64 bits arches)
> 
> Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout
> so that each cpu maps its own per_cpu area on this area, so that the local
> per_cpu data sits in the same virtual address on each cpu. Then we dont need a
> segment prefix nor adding a 'per_cpu offset'. No need to write special asm
> functions to read/write/increment a per_cpu data and gcc could use normal
> rules for optimizations.
> 
> We only would need adding "per_cpu offset" to get data for a given cpu.

That is basically what IA64 is doing but it not usable because you would 
have addresses that mean different things on different cpus. List head
for example require back pointers. If you put a listhead into such a per 
cpu area then you may corrupt another cpus per cpu area.
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array
  2007-11-01  7:24   ` Eric Dumazet
@ 2007-11-01 12:59     ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 12:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: akpm, linux-arch, linux-kernel, Mathieu Desnoyers, Pekka Enberg

[-- Attachment #1: Type: TEXT/PLAIN, Size: 769 bytes --]

On Thu, 1 Nov 2007, Eric Dumazet wrote:

> Christoph Lameter a écrit :
> > +
> > +enum unit_type { FREE, END, USED };
> > +
> > +static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
> 
> You mean END here instead of 1 :)

Sigh. A leftover. This can be removed.

> > +/*
> > + * Allocate an object of a certain size
> > + *
> > + * Returns a per cpu pointer that must not be directly used.
> > + */
> > +static void *cpu_alloc(unsigned long size)
> > +{
> 
> We might need to give an alignment constraint here. Some per_cpu users would
> like to get a 64 bytes zone, siting in one cache line and not two :)

Well not sure about that. Alignment is mostly useful on SMP with cacheline 
contention. This is a per cpu area that should not be contended.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  7:57   ` David Miller
@ 2007-11-01 13:01     ` Christoph Lameter
  2007-11-01 21:25       ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 13:01 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> IA64 seems to use it universally for every __get_cpu_var()
> access, so maybe it works out somehow :-)))

IA64 does not do that. It addds the local cpu offset

#define __get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
#define __raw_get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01  9:14                           ` David Miller
@ 2007-11-01 13:03                             ` Christoph Lameter
  2007-11-01 21:29                               ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 13:03 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> > This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
> > in some other fashion soon afterwards.
> 
> And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.

Good....

> You'll definitely need to make this work dynamically somehow.

Obviously. Any ideas how?

I can probably calculate the size based on the number of online nodes when 
the per cpu areas are setup. But the setup is done before we even parse 
command line arguments. That would still mean a fixed size after bootup.

In order to make it truly dynamic we would have to virtually map the area. 
vmap? But that reduces performance.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 13:01     ` Christoph Lameter
@ 2007-11-01 21:25       ` David Miller
  0 siblings, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-01 21:25 UTC (permalink / raw)
  To: clameter
  Cc: dada1, akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 06:01:14 -0700 (PDT)

> On Thu, 1 Nov 2007, David Miller wrote:
> 
> > IA64 seems to use it universally for every __get_cpu_var()
> > access, so maybe it works out somehow :-)))
> 
> IA64 does not do that. It addds the local cpu offset
> 
> #define __get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, 
> __ia64_per_cpu_var(local_per_cpu_offset)))
> #define __raw_get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, 
> __ia64_per_cpu_var(local_per_cpu_offset)))

Oh I see, it's the offset itself which is accessed at the fixed
virtual address slot.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 12:57   ` Christoph Lameter
@ 2007-11-01 21:28     ` David Miller
  2007-11-01 22:11       ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01 21:28 UTC (permalink / raw)
  To: clameter
  Cc: dada1, akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 05:57:12 -0700 (PDT)

> That is basically what IA64 is doing but it not usable because you would 
> have addresses that mean different things on different cpus. List head
> for example require back pointers. If you put a listhead into such a per 
> cpu area then you may corrupt another cpus per cpu area.

Indeed, but as I pointed out in another mail it actually works if you
set some rules:

1) List insert and delete is only allowed on local CPU lists.

2) List traversal is allowed on remote CPU lists.

I bet we could get all of the per-cpu users to abide by this
rule if we wanted to.

The remaining issue with accessing per-cpu areas at multiple virtual
addresses is D-cache aliasing.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 13:03                             ` Christoph Lameter
@ 2007-11-01 21:29                               ` David Miller
  2007-11-01 22:15                                 ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01 21:29 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)

> In order to make it truly dynamic we would have to virtually map the
> area.  vmap? But that reduces performance.

But it would still be faster than the double-indirection we do now,
right?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 21:28     ` David Miller
@ 2007-11-01 22:11       ` Christoph Lameter
  2007-11-01 22:14         ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 22:11 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> The remaining issue with accessing per-cpu areas at multiple virtual
> addresses is D-cache aliasing.

But that is not an issue for physicallly mapped caches.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:11       ` Christoph Lameter
@ 2007-11-01 22:14         ` David Miller
  2007-11-01 22:16           ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01 22:14 UTC (permalink / raw)
  To: clameter
  Cc: dada1, akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)

> On Thu, 1 Nov 2007, David Miller wrote:
> 
> > The remaining issue with accessing per-cpu areas at multiple virtual
> > addresses is D-cache aliasing.
> 
> But that is not an issue for physicallly mapped caches.

Right but I'd like to use this on sparc64 which has L1 D-cache
aliasing on some chips :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 21:29                               ` David Miller
@ 2007-11-01 22:15                                 ` Christoph Lameter
  2007-11-01 22:38                                   ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 22:15 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)
> 
> > In order to make it truly dynamic we would have to virtually map the
> > area.  vmap? But that reduces performance.
> 
> But it would still be faster than the double-indirection we do now,
> right?

I think I have an idea how to do this. Its a bit x86_64 specific but here 
it goes.

We define a virtual area of NR_CPUS * 2M areas that are each mapped by a
PMD. That means we have a fixed virtual address for each cpus per cpu 
area. 

First cpu is at PER_CPU_START
Second cpu is at PER_CPU_START + 2M

So the per cpu area for cpu n is easily calculated using

PER_CPU_START + cpu << 19

without any lookups.

On bootup we allocate the 2M pages.

After boot is complete we allow the reduction of the size of the per cpu 
areas . Lets say we only need 128k per cpu. Then the remaining pages will
be returned to the page allocator.

We create some sysfs thingy were one can see the current reserves of per 
cpu storage. If one wants to reduce memory then one can write something to 
that to return the remainder of the memory.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:14         ` David Miller
@ 2007-11-01 22:16           ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 22:16 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)
> 
> > On Thu, 1 Nov 2007, David Miller wrote:
> > 
> > > The remaining issue with accessing per-cpu areas at multiple virtual
> > > addresses is D-cache aliasing.
> > 
> > But that is not an issue for physicallly mapped caches.
> 
> Right but I'd like to use this on sparc64 which has L1 D-cache
> aliasing on some chips :-)

Hmmm... re my message I just send. Then we have to return the memory with 
the virtual address not with the physical address on sparc. May result in 
zones with holes though.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:15                                 ` Christoph Lameter
@ 2007-11-01 22:38                                   ` David Miller
  2007-11-01 22:48                                     ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2007-11-01 22:38 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)

> After boot is complete we allow the reduction of the size of the per cpu 
> areas . Lets say we only need 128k per cpu. Then the remaining pages will
> be returned to the page allocator.

You don't know how much you will need.  I exhausted the limit on
sparc64 very late in the boot process when the last few userland
services were starting up.

And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
per-cpu allocation area.

You have to make it fully dynamic, there is no way around it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:38                                   ` David Miller
@ 2007-11-01 22:48                                     ` Christoph Lameter
  2007-11-01 22:58                                       ` David Miller
  2007-11-01 23:00                                       ` Eric Dumazet
  0 siblings, 2 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-01 22:48 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> 
> > After boot is complete we allow the reduction of the size of the per cpu 
> > areas . Lets say we only need 128k per cpu. Then the remaining pages will
> > be returned to the page allocator.
> 
> You don't know how much you will need.  I exhausted the limit on
> sparc64 very late in the boot process when the last few userland
> services were starting up.

Well you would be able to specify how much will remain. If not it will 
just keep the 2M reserve around.

> And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
> per-cpu allocation area.

Each tunnel needs 4 bytes per cpu?

> You have to make it fully dynamic, there is no way around it.

Na. Some reasonable upper limit needs to be set. If we set that to say 
32Megabytes and do the virtual mapping then we can just populate the first 
2M and only allocate the remainder if we need it. Then we need to rely on 
Mel's defrag stuff though defrag memory if we need it.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:48                                     ` Christoph Lameter
@ 2007-11-01 22:58                                       ` David Miller
  2007-11-02  1:06                                         ` Christoph Lameter
                                                           ` (2 more replies)
  2007-11-01 23:00                                       ` Eric Dumazet
  1 sibling, 3 replies; 62+ messages in thread
From: David Miller @ 2007-11-01 22:58 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 15:48:00 -0700 (PDT)

> On Thu, 1 Nov 2007, David Miller wrote:
> 
> > From: Christoph Lameter <clameter@sgi.com>
> > Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> > 
> > > After boot is complete we allow the reduction of the size of the per cpu 
> > > areas . Lets say we only need 128k per cpu. Then the remaining pages will
> > > be returned to the page allocator.
> > 
> > You don't know how much you will need.  I exhausted the limit on
> > sparc64 very late in the boot process when the last few userland
> > services were starting up.
> 
> Well you would be able to specify how much will remain. If not it will 
> just keep the 2M reserve around.
> 
> > And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
> > per-cpu allocation area.
> 
> Each tunnel needs 4 bytes per cpu?

Each IP compression tunnel instance does an alloc_percpu().

Since you're the one who wants to change the semantics and guarentees
of this interface, perhaps it might help if you did some greps around
the tree to see how alloc_percpu() is actually used.  That's what
I did when I started running into trouble with your patches.

You cannot put limits of the amount of alloc_percpu() memory available
to clients, please let's proceed with that basic understanding in
mind.  We're wasting a ton of time discussing this fundamental issue.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:48                                     ` Christoph Lameter
  2007-11-01 22:58                                       ` David Miller
@ 2007-11-01 23:00                                       ` Eric Dumazet
  2007-11-02  0:58                                         ` Christoph Lameter
  2007-11-02  1:40                                         ` Christoph Lameter
  1 sibling, 2 replies; 62+ messages in thread
From: Eric Dumazet @ 2007-11-01 23:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

Christoph Lameter a écrit :
> On Thu, 1 Nov 2007, David Miller wrote:
> 
>> From: Christoph Lameter <clameter@sgi.com>
>> Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
>>
>>> After boot is complete we allow the reduction of the size of the per cpu 
>>> areas . Lets say we only need 128k per cpu. Then the remaining pages will
>>> be returned to the page allocator.
>> You don't know how much you will need.  I exhausted the limit on
>> sparc64 very late in the boot process when the last few userland
>> services were starting up.
> 
> Well you would be able to specify how much will remain. If not it will 
> just keep the 2M reserve around.
> 
>> And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
>> per-cpu allocation area.
> 
> Each tunnel needs 4 bytes per cpu?

well, if we move last_rx to a percpu var, we need  8 bytes of percpu space per 
net_device :)

> 
>> You have to make it fully dynamic, there is no way around it.
> 
> Na. Some reasonable upper limit needs to be set. If we set that to say 
> 32Megabytes and do the virtual mapping then we can just populate the first 
> 2M and only allocate the remainder if we need it. Then we need to rely on 
> Mel's defrag stuff though defrag memory if we need it.

If a 2MB page is not available, could we revert using 4KB pages ? (like 
vmalloc stuff), paying an extra runtime overhead of course.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 23:00                                       ` Eric Dumazet
@ 2007-11-02  0:58                                         ` Christoph Lameter
  2007-11-02  1:40                                         ` Christoph Lameter
  1 sibling, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-02  0:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

On Fri, 2 Nov 2007, Eric Dumazet wrote:

> > Na. Some reasonable upper limit needs to be set. If we set that to say
> > 32Megabytes and do the virtual mapping then we can just populate the first
> > 2M and only allocate the remainder if we need it. Then we need to rely on
> > Mel's defrag stuff though defrag memory if we need it.
> 
> If a 2MB page is not available, could we revert using 4KB pages ? (like
> vmalloc stuff), paying an extra runtime overhead of course.

Sure. Its going to be like vmemmap. There will be limited imposed though 
by the amount of virtual space available. Basically the dynamic per cpu 
area can be at maximum

available_virtual_space / NR_CPUS


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:58                                       ` David Miller
@ 2007-11-02  1:06                                         ` Christoph Lameter
  2007-11-02  2:51                                           ` David Miller
  2007-11-02 10:28                                         ` Peter Zijlstra
  2007-11-12 10:52                                         ` Herbert Xu
  2 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-02  1:06 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

On Thu, 1 Nov 2007, David Miller wrote:

> You cannot put limits of the amount of alloc_percpu() memory available
> to clients, please let's proceed with that basic understanding in
> mind.  We're wasting a ton of time discussing this fundamental issue.

There is no point in making absolute demands like "no limits". There are 
always limits to everything. 

A new implementation avoids the need to allocate per cpu arrays and also 
avoids the 32 bytes per object times cpus that are mostly wasted for small 
allocations today. So its going to potentially allow more per cpu objects
that available today.

A reasonable implementation for 64 bit is likely going to depend on 
reserving some virtual memory space for the per cpu mappings so that they 
can be dynamically grown up to what the reserved virtual space allows.

F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
then there is a limit on the per cpu space available of 16MB.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 23:00                                       ` Eric Dumazet
  2007-11-02  0:58                                         ` Christoph Lameter
@ 2007-11-02  1:40                                         ` Christoph Lameter
  1 sibling, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-02  1:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

Hmmm... On x86_64 we could take 8 terabyte virtual space (bit order 43)

With the worst case scenario of 16k of cpus (bit order 16) we are looking 
at 43-16 = 27 ~ 128MB per cpu. Each percpu can at max be mapped by 64 pmd 
entries. 4k support is actually max for projected hw. So we'd get 
to 512M. 

On IA64 we could take half of the vmemmap area which is 45 bits. So 
we could get up to 512MB (with 16k pages, 64k pages can get us even 
further) assuming we can at some point run 16 processors per node (4k is 
the current max which would put the limit on the per cpu area >1GB).

Lets say you have a system with 64 cpus and an area of 128M of per cpu 
storage. Then we are using 8GB of total memory for per cpu storage. The 
128M allows us to store f.e.  16 M of word size counters.

With SLAB and the current allocpercpu you would need the following for 
16M counters:

16M*32*64 (minimum alloc size of SLAB is 32 byte and we alloc via 
		kmalloc) for the data.

16M*64*8 for the pointer arrays. 16M allocpercpu areas for 64 processors 
		and a pointer size of 8 bytes.

So you would need to use 40G in current systems. The new scheme 
would only need 8GB for the same amount of counters.

So I think its unreasonable to assume that currently systems exist that 
can use more than 128m of allocpercpu space (assuming 64 cpus).

---
 include/asm-x86/pgtable_64.h |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h	2007-11-01 18:15:52.282577904 -0700
+++ linux-2.6/include/asm-x86/pgtable_64.h	2007-11-01 18:18:02.886979040 -0700
@@ -138,10 +138,14 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START    _AC(0xffffc20000000000, UL)
 #define VMALLOC_END      _AC(0xffffe1ffffffffff, UL)
 #define VMEMMAP_START	 _AC(0xffffe20000000000, UL)
+#define PERCPU_START	 _AC(0xfffff20000000000, UL)
+#define PERCPU_END	 _AC(0xfffffa0000000000, UL)
 #define MODULES_VADDR    _AC(0xffffffff88000000, UL)
 #define MODULES_END      _AC(0xfffffffffff00000, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)

+#define PERCPU_MIN_SHIFT	PMD_SHIFT
+#define PERCPU_BITS		43
+
 #define _PAGE_BIT_PRESENT	0
 #define _PAGE_BIT_RW		1
 #define _PAGE_BIT_USER		2

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-02  1:06                                         ` Christoph Lameter
@ 2007-11-02  2:51                                           ` David Miller
  0 siblings, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-02  2:51 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-arch, linux-kernel, mathieu.desnoyers, penberg

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 1 Nov 2007 18:06:17 -0700 (PDT)

> A reasonable implementation for 64 bit is likely going to depend on 
> reserving some virtual memory space for the per cpu mappings so that they 
> can be dynamically grown up to what the reserved virtual space allows.
> 
> F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
> then there is a limit on the per cpu space available of 16MB.

Now that I understand your implementation better, yes this
sounds just fine.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:58                                       ` David Miller
  2007-11-02  1:06                                         ` Christoph Lameter
@ 2007-11-02 10:28                                         ` Peter Zijlstra
  2007-11-02 14:35                                           ` Christoph Lameter
  2007-11-12 10:52                                         ` Herbert Xu
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2007-11-02 10:28 UTC (permalink / raw)
  To: David Miller
  Cc: clameter, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:

> Since you're the one who wants to change the semantics and guarentees
> of this interface, perhaps it might help if you did some greps around
> the tree to see how alloc_percpu() is actually used.  That's what
> I did when I started running into trouble with your patches.

This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().

That means that for example each NFS mount also consumes a number of
words - not quite sure from the top of my head how many, might be in the
order of 24 bytes or something.

I once before started looking at this, because the current
alloc_percpu() can have some false sharing - not that I have machines
that are overly bothered by that. I like the idea of a strict percpu
region, however do be aware of the users.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-02 10:28                                         ` Peter Zijlstra
@ 2007-11-02 14:35                                           ` Christoph Lameter
  2007-11-02 15:20                                             ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-02 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

On Fri, 2 Nov 2007, Peter Zijlstra wrote:

> On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:
> 
> > Since you're the one who wants to change the semantics and guarentees
> > of this interface, perhaps it might help if you did some greps around
> > the tree to see how alloc_percpu() is actually used.  That's what
> > I did when I started running into trouble with your patches.
> 
> This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().

Yes there are numerous uses. I even can increase page allocator 
performance and reduce its memory footprint by using it here.

> That means that for example each NFS mount also consumes a number of
> words - not quite sure from the top of my head how many, might be in the
> order of 24 bytes or something.
> 
> I once before started looking at this, because the current
> alloc_percpu() can have some false sharing - not that I have machines
> that are overly bothered by that. I like the idea of a strict percpu
> region, however do be aware of the users.

Well I wonder if I should introduce it not as a replacement but as an 
alternative to allocpercpu? We can then gradually switch over. The 
existing API does not allow the specification of gfp_masks or alignements.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-02 14:35                                           ` Christoph Lameter
@ 2007-11-02 15:20                                             ` Peter Zijlstra
  2007-11-02 15:29                                               ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2007-11-02 15:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

On Fri, 2007-11-02 at 07:35 -0700, Christoph Lameter wrote:

> Well I wonder if I should introduce it not as a replacement but as an 
> alternative to allocpercpu? We can then gradually switch over. The 
> existing API does not allow the specification of gfp_masks or alignements.

I've thought about suggesting that very thing. However, I think we need
to have a clear view of where we're going with that so that we don't end
up with two per cpu allocators because some users could not be converted
over or some such.





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-02 15:20                                             ` Peter Zijlstra
@ 2007-11-02 15:29                                               ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2007-11-02 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

On Fri, 2 Nov 2007, Peter Zijlstra wrote:

> On Fri, 2007-11-02 at 07:35 -0700, Christoph Lameter wrote:
> 
> > Well I wonder if I should introduce it not as a replacement but as an 
> > alternative to allocpercpu? We can then gradually switch over. The 
> > existing API does not allow the specification of gfp_masks or alignements.
> 
> I've thought about suggesting that very thing. However, I think we need
> to have a clear view of where we're going with that so that we don't end
> up with two per cpu allocators because some users could not be converted
> over or some such.

At least in my tests so far show that it can be a full replacement but 
then I have only tested on x86_64 and Ia64. Its likely much easier to go
for the full replacement rather than in steps.

If we want dynamically sized virtually mapped per cpu areas then we may 
have issues on 32 bit platforms and with !MMU. So I would think that a 
fallback to a statically sized version may be needed. On the other hand
!MMU and 32 bit do not support a large number of processors. So we may be 
able to get away on 32 bit with a small virtual memory area.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-01 22:58                                       ` David Miller
  2007-11-02  1:06                                         ` Christoph Lameter
  2007-11-02 10:28                                         ` Peter Zijlstra
@ 2007-11-12 10:52                                         ` Herbert Xu
  2007-11-12 19:14                                           ` Christoph Lameter
  2007-11-12 21:28                                           ` David Miller
  2 siblings, 2 replies; 62+ messages in thread
From: Herbert Xu @ 2007-11-12 10:52 UTC (permalink / raw)
  To: David Miller
  Cc: clameter, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

David Miller <davem@davemloft.net> wrote:
> 
> Each IP compression tunnel instance does an alloc_percpu().

Actually all IPComp tunnels share one set of objects which are
allocated per-cpu.  So only the first tunnel would do that.

In fact that was precisely the reason why per-cpu is used in
IPComp as otherwise we can just allocate normal memory.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 10:52                                         ` Herbert Xu
@ 2007-11-12 19:14                                           ` Christoph Lameter
  2007-11-12 19:48                                             ` Eric Dumazet
  2007-11-12 21:28                                           ` David Miller
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-12 19:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

On Mon, 12 Nov 2007, Herbert Xu wrote:

> David Miller <davem@davemloft.net> wrote:
> > 
> > Each IP compression tunnel instance does an alloc_percpu().
> 
> Actually all IPComp tunnels share one set of objects which are
> allocated per-cpu.  So only the first tunnel would do that.

Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 19:14                                           ` Christoph Lameter
@ 2007-11-12 19:48                                             ` Eric Dumazet
  2007-11-12 19:56                                               ` Christoph Lameter
  2007-11-12 19:57                                               ` Luck, Tony
  0 siblings, 2 replies; 62+ messages in thread
From: Eric Dumazet @ 2007-11-12 19:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Herbert Xu, David Miller, akpm, linux-arch, linux-kernel,
	mathieu.desnoyers, penberg

Christoph Lameter a écrit :
> On Mon, 12 Nov 2007, Herbert Xu wrote:
> 
>> David Miller <davem@davemloft.net> wrote:
>>> Each IP compression tunnel instance does an alloc_percpu().
>> Actually all IPComp tunnels share one set of objects which are
>> allocated per-cpu.  So only the first tunnel would do that.
> 
> Ahh so the need to be able to expand per cpu memory storage on demand 
> is not as critical as we thought.
> 

Yes, but still desirable for future optimizations.

For example, I do think using a per cpu memory storage on net_device refcnt & 
last_rx could give us some speedups.




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 19:48                                             ` Eric Dumazet
@ 2007-11-12 19:56                                               ` Christoph Lameter
  2007-11-12 20:18                                                 ` Eric Dumazet
  2007-11-12 19:57                                               ` Luck, Tony
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2007-11-12 19:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, akpm, linux-arch, linux-kernel,
	mathieu.desnoyers, penberg

[-- Attachment #1: Type: TEXT/PLAIN, Size: 872 bytes --]

On Mon, 12 Nov 2007, Eric Dumazet wrote:

> Christoph Lameter a écrit :
> > On Mon, 12 Nov 2007, Herbert Xu wrote:
> > 
> > > David Miller <davem@davemloft.net> wrote:
> > > > Each IP compression tunnel instance does an alloc_percpu().
> > > Actually all IPComp tunnels share one set of objects which are
> > > allocated per-cpu.  So only the first tunnel would do that.
> > 
> > Ahh so the need to be able to expand per cpu memory storage on demand is not
> > as critical as we thought.
> > 
> 
> Yes, but still desirable for future optimizations.
> 
> For example, I do think using a per cpu memory storage on net_device refcnt &
> last_rx could give us some speedups.

Note that there was a new patchset posted (titled cpu alloc v1) that 
provides on demand extension of the cpu areas.

See http://marc.info/?l=linux-kernel&m=119438261304093&w=2

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 19:48                                             ` Eric Dumazet
  2007-11-12 19:56                                               ` Christoph Lameter
@ 2007-11-12 19:57                                               ` Luck, Tony
  2007-11-12 20:14                                                 ` Eric Dumazet
  1 sibling, 1 reply; 62+ messages in thread
From: Luck, Tony @ 2007-11-12 19:57 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Lameter
  Cc: Herbert Xu, David Miller, akpm, linux-arch, linux-kernel,
	mathieu.desnoyers, penberg

> > Ahh so the need to be able to expand per cpu memory storage on demand 
> > is not as critical as we thought.
> > 
>
> Yes, but still desirable for future optimizations.
>
> For example, I do think using a per cpu memory storage on net_device refcnt & 
> last_rx could give us some speedups.

We do want to keep a very tight handle on bloat in per-cpu
allocations.  By definition the total allocation is multiplied
by the number of cpus.  Only ia64 has outrageous numbers of
cpus in a single system image today ... but the trend in
multi-core chips looks to have a Moore's law arc to it, so
everyone is going to be looking at lots of cpus before long.

-Tony

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 19:57                                               ` Luck, Tony
@ 2007-11-12 20:14                                                 ` Eric Dumazet
  2007-11-12 22:46                                                   ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2007-11-12 20:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Christoph Lameter, Herbert Xu, David Miller, akpm, linux-arch,
	linux-kernel, mathieu.desnoyers, penberg

Luck, Tony a écrit :
>>> Ahh so the need to be able to expand per cpu memory storage on demand 
>>> is not as critical as we thought.
>>>
>> Yes, but still desirable for future optimizations.
>>
>> For example, I do think using a per cpu memory storage on net_device refcnt & 
>> last_rx could give us some speedups.
> 
> We do want to keep a very tight handle on bloat in per-cpu
> allocations.  By definition the total allocation is multiplied
> by the number of cpus.  Only ia64 has outrageous numbers of
> cpus in a single system image today ... but the trend in
> multi-core chips looks to have a Moore's law arc to it, so
> everyone is going to be looking at lots of cpus before long.
> 

I dont think this is a problem. Cpus numbers and ram size are related, even if 
Moore didnt predicted it;

Nobody wants to ship/build a 4096 cpus machine with 256 MB of ram inside.
Or call it a GPU and dont expect it to run linux :)

99,9% of linux machines running on earth have less than 8 cpus and less than 
1000 ethernet/network devices.

In case we increase the number of cpus on a machine, the limiting factor is 
the fact that cpus have to continually exchange on memory bus those highly 
touched cache lines that contain refcounters or stats.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 19:56                                               ` Christoph Lameter
@ 2007-11-12 20:18                                                 ` Eric Dumazet
  2007-11-12 22:46                                                   ` David Miller
  0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2007-11-12 20:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Herbert Xu, David Miller, akpm, linux-arch, linux-kernel,
	mathieu.desnoyers, penberg

Christoph Lameter a écrit :
> On Mon, 12 Nov 2007, Eric Dumazet wrote:
>> For example, I do think using a per cpu memory storage on net_device refcnt &
>> last_rx could give us some speedups.
> 
> Note that there was a new patchset posted (titled cpu alloc v1) that 
> provides on demand extension of the cpu areas.
> 
> See http://marc.info/?l=linux-kernel&m=119438261304093&w=2

Thank you Christoph. I was traveling last week so I missed that.

This new patchset looks very interesting, you did a fantastic job !


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 10:52                                         ` Herbert Xu
  2007-11-12 19:14                                           ` Christoph Lameter
@ 2007-11-12 21:28                                           ` David Miller
  1 sibling, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-12 21:28 UTC (permalink / raw)
  To: herbert
  Cc: clameter, akpm, linux-arch, linux-kernel, mathieu.desnoyers,
	penberg

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 12 Nov 2007 18:52:35 +0800

> David Miller <davem@davemloft.net> wrote:
> > 
> > Each IP compression tunnel instance does an alloc_percpu().
> 
> Actually all IPComp tunnels share one set of objects which are
> allocated per-cpu.  So only the first tunnel would do that.
> 
> In fact that was precisely the reason why per-cpu is used in
> IPComp as otherwise we can just allocate normal memory.

Hmmm... indeed.  Thanks for clearing this up.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 20:14                                                 ` Eric Dumazet
@ 2007-11-12 22:46                                                   ` David Miller
  0 siblings, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-12 22:46 UTC (permalink / raw)
  To: dada1
  Cc: tony.luck, clameter, herbert, akpm, linux-arch, linux-kernel,
	mathieu.desnoyers, penberg

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 12 Nov 2007 21:14:47 +0100

> I dont think this is a problem. Cpus numbers and ram size are related, even if 
> Moore didnt predicted it;
> 
> Nobody wants to ship/build a 4096 cpus machine with 256 MB of ram inside.
> Or call it a GPU and dont expect it to run linux :)
> 
> 99,9% of linux machines running on earth have less than 8 cpus and less than 
> 1000 ethernet/network devices.
> 
> In case we increase the number of cpus on a machine, the limiting factor is 
> the fact that cpus have to continually exchange on memory bus those highly 
> touched cache lines that contain refcounters or stats.

I totally agree with everything Eric is saying here.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead
  2007-11-12 20:18                                                 ` Eric Dumazet
@ 2007-11-12 22:46                                                   ` David Miller
  0 siblings, 0 replies; 62+ messages in thread
From: David Miller @ 2007-11-12 22:46 UTC (permalink / raw)
  To: dada1
  Cc: clameter, herbert, akpm, linux-arch, linux-kernel,
	mathieu.desnoyers, penberg

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 12 Nov 2007 21:18:17 +0100

> Christoph Lameter a écrit :
> > On Mon, 12 Nov 2007, Eric Dumazet wrote:
> >> For example, I do think using a per cpu memory storage on net_device refcnt &
> >> last_rx could give us some speedups.
> > 
> > Note that there was a new patchset posted (titled cpu alloc v1) that 
> > provides on demand extension of the cpu areas.
> > 
> > See http://marc.info/?l=linux-kernel&m=119438261304093&w=2
> 
> Thank you Christoph. I was traveling last week so I missed that.
> 
> This new patchset looks very interesting, you did a fantastic job !

Yes I like it too.  It's in my backlog of things to test on
sparc64.

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2007-11-12 22:46 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-01  0:02 [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Christoph Lameter
2007-11-01  0:02 ` [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array Christoph Lameter
2007-11-01  7:24   ` Eric Dumazet
2007-11-01 12:59     ` Christoph Lameter
2007-11-01  0:02 ` [patch 2/7] allocpercpu: Remove functions that are rarely used Christoph Lameter
2007-11-01  0:02 ` [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set Christoph Lameter
2007-11-01  7:25   ` Eric Dumazet
2007-11-01  0:02 ` [patch 4/7] Percpu: Add support for this_cpu_offset() to be able to create this_cpu_ptr() Christoph Lameter
2007-11-01  0:02 ` [patch 5/7] SLUB: Use allocpercpu to allocate per cpu data instead of running our own per cpu allocator Christoph Lameter
2007-11-01  0:02 ` [patch 6/7] SLUB: No need to cache kmem_cache data in kmem_cache_cpu anymore Christoph Lameter
2007-11-01  0:02 ` [patch 7/7] SLUB: Optimize per cpu access on the local cpu using this_cpu_ptr() Christoph Lameter
2007-11-01  0:24 ` [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead David Miller
2007-11-01  0:26   ` Christoph Lameter
2007-11-01  0:27     ` David Miller
2007-11-01  0:31       ` Christoph Lameter
2007-11-01  0:51         ` David Miller
2007-11-01  0:53           ` Christoph Lameter
2007-11-01  1:00             ` David Miller
2007-11-01  1:01               ` Christoph Lameter
2007-11-01  1:09                 ` David Miller
2007-11-01  1:12                   ` Christoph Lameter
2007-11-01  1:13                     ` David Miller
2007-11-01  1:21                       ` Christoph Lameter
2007-11-01  5:27                         ` David Miller
2007-11-01  4:16                       ` Christoph Lameter
2007-11-01  5:38                         ` David Miller
2007-11-01  7:01                         ` David Miller
2007-11-01  9:14                           ` David Miller
2007-11-01 13:03                             ` Christoph Lameter
2007-11-01 21:29                               ` David Miller
2007-11-01 22:15                                 ` Christoph Lameter
2007-11-01 22:38                                   ` David Miller
2007-11-01 22:48                                     ` Christoph Lameter
2007-11-01 22:58                                       ` David Miller
2007-11-02  1:06                                         ` Christoph Lameter
2007-11-02  2:51                                           ` David Miller
2007-11-02 10:28                                         ` Peter Zijlstra
2007-11-02 14:35                                           ` Christoph Lameter
2007-11-02 15:20                                             ` Peter Zijlstra
2007-11-02 15:29                                               ` Christoph Lameter
2007-11-12 10:52                                         ` Herbert Xu
2007-11-12 19:14                                           ` Christoph Lameter
2007-11-12 19:48                                             ` Eric Dumazet
2007-11-12 19:56                                               ` Christoph Lameter
2007-11-12 20:18                                                 ` Eric Dumazet
2007-11-12 22:46                                                   ` David Miller
2007-11-12 19:57                                               ` Luck, Tony
2007-11-12 20:14                                                 ` Eric Dumazet
2007-11-12 22:46                                                   ` David Miller
2007-11-12 21:28                                           ` David Miller
2007-11-01 23:00                                       ` Eric Dumazet
2007-11-02  0:58                                         ` Christoph Lameter
2007-11-02  1:40                                         ` Christoph Lameter
2007-11-01  7:17 ` Eric Dumazet
2007-11-01  7:57   ` David Miller
2007-11-01 13:01     ` Christoph Lameter
2007-11-01 21:25       ` David Miller
2007-11-01 12:57   ` Christoph Lameter
2007-11-01 21:28     ` David Miller
2007-11-01 22:11       ` Christoph Lameter
2007-11-01 22:14         ` David Miller
2007-11-01 22:16           ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).