linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/6] Per cpu structures for SLUB
@ 2007-08-23  6:46 Christoph Lameter
  2007-08-23  6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

The following patchset introduces per cpu structures for SLUB. These
are very small (and multiples of these may fit into one cacheline)
and (apart from performance improvements) allow the addressing of
several isues in SLUB:

1. The number of objects per slab is no longer limited to a 16 bit
   number.

2. Room is freed up in the page struct. We can avoid using the
   mapping field which allows to get rid of the #ifdef CONFIG_SLUB
   in page_mapping().

3. We will have an easier time adding new things like Peter Z.s reserve
   management.

The RFC for this patchset was discussed on lkml a while ago:

http://marc.info/?l=linux-kernel&m=118386677704534&w=2

(And no this patchset does not include the use of cmpxchg_local that
we discussed recently on lkml nor the cmpxchg implementation
mentioned in the RFC)

Performance
-----------


Norm = 2.6.23-rc3
PCPU = Adds page allocator pass through plus per cpu structure patches


IA64 8p 4n NUMA Altix

            Single threaded               Concurrent Alloc

	Kmalloc		Alloc/Free	Kmalloc         Alloc/Free
 Size	Norm   PCPU	Norm   PCPU	Norm   PCPU	Norm   PCPU
-------------------------------------------------------------------
    8	132	84	93	104	98	90	95	106
   16    98	92	93	104	115	98	95	106
   32   112	105	93	104	146	111	95	106
   64	119	112	93	104	214	133	95	106
  128   132	119	94	104	321	163	95	106
  256+  83255	176	106	115	415	224	108	117
  512   191	176	106	115	487	341	108	117
 1024   252	246	106	115	937	609	108	117
 2048   308	292	107	115	2494	1207	108	117
 4096   341	319	107	115	2497	1217	108	117
 8192   402	380	107	115	2367	1188	108	117
16384*  560	474	106	434	4464	1904	108	478

X86_64 2p SMP (Dual Core Pentium 940)

         Single threaded                   Concurrent Alloc

        Kmalloc         Alloc/Free      Kmalloc         Alloc/Free
 Size   Norm   PCPU     Norm   PCPU     Norm   PCPU     Norm   PCPU
--------------------------------------------------------------------
    8	313	227	314	324	207	208	314	323
   16   202	203	315	324	209	211	312	321
   32	212	207	314	324	251	243	312	321
   64	240	237	314	326	329	306	312	321
  128	301	302	314	324	511	416	313	324
  256   498	554	327	332	970	837	326	332
  512   532	553	324	332	1025	932	326	335
 1024   705	718	325	333	1489	1231	324	330
 2048   764	767	324	334	2708	2175	324	332
 4096* 1033	476	325	674	4727	782	324	678

Notes:

Worst case:
-----------
We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
since the processing overhead increases because we need to lookup
the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
So objects with the shortest lifetime possible. We would never use
objects in that way but the measurement is important to show the worst
case overhead created.

Single Threaded:
----------------
The single threaded kmalloc test shows behavior of a continual stream
of allocation without contention. In the SMP case the losses are minimal.
In the NUMA case we already have a winner there because the per cpu structure
is placed local to the processor. So in the single threaded case we already
win around 5% just by placing things better.

Concurrent Alloc:
-----------------
We have varying gains up to a 50% on NUMA because we are now never updating
a cacheline used by the other processor and the data structures are local
to the processor.

The SMP case shows gains but they are smaller (especially since
this is the smallest SMP system possible.... 2 CPUs). So only up
to 25%.

Page allocator pass through
---------------------------
There is a significant difference in the columns marked with a * because
of the way that allocations for page sized objects are handled. If we handle
the allocations in the slab allocator (Norm) then the alloc free tests
results are superb since we can use the per cpu slab to just pass a pointer
back and forth. The page allocator pass through (PCPU) shows that the page
allocator may have problems with giving back the same page after a free.
Or there something else in the page allocator that creates significant
overhead compared to slab. Needs to be checked out I guess.

However, the page allocator pass through is a win in the other cases
since we can cut out the page allocator overhead. That is the more typical
load of allocating a sequence of objects and we should optimize for that.

(+ = Must be some cache artifact here or code crossing a TLB boundary.
The result is reproducable)

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
@ 2007-08-23  6:46 ` Christoph Lameter
  2007-08-23  6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

[-- Attachment #1: 0005-SLUB-Avoid-page-struct-cacheline-bouncing-due-to-re.patch --]
[-- Type: text/plain, Size: 16455 bytes --]

A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
  Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
  checks.

Effect on allocations:

Cachelines touched before this patch:

	Write:	page cache struct and first cacheline of object

Cachelines touched after this patch:

	Write:	kmem_cache_cpu cacheline and first cacheline of object
	Read: page cache struct (but see later patch that avoids touching
		that cacheline)

The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.

Effect on freeing:

Cachelines touched before this patch:

	Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

  Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
  Write(to other):	page struct and first cacheline of object

  Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
  			avoids touching the page struct on free)
  Read(to other):	cpu local kmem_cache_cpu struct to verify its not
  			the cpu slab.

Summary:

Pro:
	- Distinct cachelines so that concurrent remote frees and local
	  allocs on a cpuslab can occur without cacheline bouncing.
	- Avoids potential bouncing cachelines because of neighboring
	  per cpu pointer updates in kmem_cache's cpu_slab structure since
	  it now grows to a cacheline (Therefore remove the comment
	  that talks about that concern).

Cons:
	- Freeing objects now requires the reading of one additional
	  cacheline. That can be mitigated for some cases by the following
	  patches but its not possible to completely eliminate these
	  references.

	- Memory usage grows slightly.

	The size of each per cpu object is blown up from one word
	(pointing to the page_struct) to one cacheline with various data.
	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
	NR_SLABS is 100 and a cache line size of 128 then we have just
	increased SLAB metadata requirements by 12.8k per cpu.
	(Another later patch reduces these requirements)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    9 +-
 mm/slub.c                |  192 ++++++++++++++++++++++++++++-------------------
 2 files changed, 126 insertions(+), 75 deletions(-)

Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h	2007-08-22 17:14:32.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h	2007-08-22 17:18:56.000000000 -0700
@@ -11,6 +11,13 @@
 #include <linux/workqueue.h>
 #include <linux/kobject.h>
 
+struct kmem_cache_cpu {
+	void **freelist;
+	struct page *page;
+	int node;
+	/* Lots of wasted space */
+} ____cacheline_aligned_in_smp;
+
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
 	unsigned long nr_partial;
@@ -54,7 +61,7 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-	struct page *cpu_slab[NR_CPUS];
+	struct kmem_cache_cpu cpu_slab[NR_CPUS];
 };
 
 /*
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c	2007-08-22 17:18:50.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c	2007-08-22 17:20:05.000000000 -0700
@@ -90,7 +90,7 @@
  * 			One use of this flag is to mark slabs that are
  * 			used for allocations. Then such a slab becomes a cpu
  * 			slab. The cpu slab may be equipped with an additional
- * 			lockless_freelist that allows lockless access to
+ * 			freelist that allows lockless access to
  * 			free objects in addition to the regular freelist
  * 			that requires the slab lock.
  *
@@ -140,11 +140,6 @@ static inline void ClearSlabDebug(struct
 /*
  * Issues still to be resolved:
  *
- * - The per cpu array is updated for each new slab and and is a remote
- *   cacheline for most nodes. This could become a bouncing cacheline given
- *   enough frequent updates. There are 16 pointers in a cacheline, so at
- *   max 16 cpus could compete for the cacheline which may be okay.
- *
  * - Support PAGE_ALLOC_DEBUG. Should be easy to do.
  *
  * - Variable sizing of the per node arrays
@@ -284,6 +279,11 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+	return &s->cpu_slab[cpu];
+}
+
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
 {
@@ -1385,33 +1385,34 @@ static void unfreeze_slab(struct kmem_ca
 /*
  * Remove the cpu slab
  */
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
+	struct page *page = c->page;
 	/*
 	 * Merge cpu freelist into freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(page->lockless_freelist)) {
+	while (unlikely(c->freelist)) {
 		void **object;
 
 		/* Retrieve object from cpu_freelist */
-		object = page->lockless_freelist;
-		page->lockless_freelist = page->lockless_freelist[page->offset];
+		object = c->freelist;
+		c->freelist = c->freelist[page->offset];
 
 		/* And put onto the regular freelist */
 		object[page->offset] = page->freelist;
 		page->freelist = object;
 		page->inuse--;
 	}
-	s->cpu_slab[cpu] = NULL;
+	c->page = NULL;
 	unfreeze_slab(s, page);
 }
 
-static inline void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	slab_lock(page);
-	deactivate_slab(s, page, cpu);
+	slab_lock(c->page);
+	deactivate_slab(s, c);
 }
 
 /*
@@ -1420,18 +1421,17 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct page *page = s->cpu_slab[cpu];
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 
-	if (likely(page))
-		flush_slab(s, page, cpu);
+	if (likely(c && c->page))
+		flush_slab(s, c);
 }
 
 static void flush_cpu_slab(void *d)
 {
 	struct kmem_cache *s = d;
-	int cpu = smp_processor_id();
 
-	__flush_cpu_slab(s, cpu);
+	__flush_cpu_slab(s, smp_processor_id());
 }
 
 static void flush_all(struct kmem_cache *s)
@@ -1448,6 +1448,19 @@ static void flush_all(struct kmem_cache 
 }
 
 /*
+ * Check if the objects in a per cpu structure fit numa
+ * locality expectations.
+ */
+static inline int node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+	if (node != -1 && c->node != node)
+		return 0;
+#endif
+	return 1;
+}
+
+/*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
  *
@@ -1465,45 +1478,46 @@ static void flush_all(struct kmem_cache 
  * we need to allocate a new slab. This is slowest path since we may sleep.
  */
 static void *__slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node, void *addr, struct page *page)
+		gfp_t gfpflags, int node, void *addr, struct kmem_cache_cpu *c)
 {
 	void **object;
-	int cpu = smp_processor_id();
+	struct page *new;
 
-	if (!page)
+	if (!c->page)
 		goto new_slab;
 
-	slab_lock(page);
-	if (unlikely(node != -1 && page_to_nid(page) != node))
+	slab_lock(c->page);
+	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 load_freelist:
-	object = page->freelist;
+	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SlabDebug(page)))
+	if (unlikely(SlabDebug(c->page)))
 		goto debug;
 
-	object = page->freelist;
-	page->lockless_freelist = object[page->offset];
-	page->inuse = s->objects;
-	page->freelist = NULL;
-	slab_unlock(page);
+	object = c->page->freelist;
+	c->freelist = object[c->page->offset];
+	c->page->inuse = s->objects;
+	c->page->freelist = NULL;
+	c->node = page_to_nid(c->page);
+	slab_unlock(c->page);
 	return object;
 
 another_slab:
-	deactivate_slab(s, page, cpu);
+	deactivate_slab(s, c);
 
 new_slab:
-	page = get_partial(s, gfpflags, node);
-	if (page) {
-		s->cpu_slab[cpu] = page;
+	new = get_partial(s, gfpflags, node);
+	if (new) {
+		c->page = new;
 		goto load_freelist;
 	}
 
-	page = new_slab(s, gfpflags, node);
-	if (page) {
-		cpu = smp_processor_id();
-		if (s->cpu_slab[cpu]) {
+	new = new_slab(s, gfpflags, node);
+	if (new) {
+		c = get_cpu_slab(s, smp_processor_id());
+		if (c->page) {
 			/*
 			 * Someone else populated the cpu_slab while we
 			 * enabled interrupts, or we have gotten scheduled
@@ -1511,34 +1525,32 @@ new_slab:
 			 * requested node even if __GFP_THISNODE was
 			 * specified. So we need to recheck.
 			 */
-			if (node == -1 ||
-				page_to_nid(s->cpu_slab[cpu]) == node) {
+			if (node_match(c, node)) {
 				/*
 				 * Current cpuslab is acceptable and we
 				 * want the current one since its cache hot
 				 */
-				discard_slab(s, page);
-				page = s->cpu_slab[cpu];
-				slab_lock(page);
+				discard_slab(s, new);
+				slab_lock(c->page);
 				goto load_freelist;
 			}
 			/* New slab does not fit our expectations */
-			flush_slab(s, s->cpu_slab[cpu], cpu);
+			flush_slab(s, c);
 		}
-		slab_lock(page);
-		SetSlabFrozen(page);
-		s->cpu_slab[cpu] = page;
+		slab_lock(new);
+		SetSlabFrozen(new);
+		c->page = new;
 		goto load_freelist;
 	}
 	return NULL;
 debug:
-	object = page->freelist;
-	if (!alloc_debug_processing(s, page, object, addr))
+	object = c->page->freelist;
+	if (!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
-	page->inuse++;
-	page->freelist = object[page->offset];
-	slab_unlock(page);
+	c->page->inuse++;
+	c->page->freelist = object[c->page->offset];
+	slab_unlock(c->page);
 	return object;
 }
 
@@ -1555,20 +1567,20 @@ debug:
 static void __always_inline *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, void *addr)
 {
-	struct page *page;
 	void **object;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
-	page = s->cpu_slab[smp_processor_id()];
-	if (unlikely(!page || !page->lockless_freelist ||
-			(node != -1 && page_to_nid(page) != node)))
+	c = get_cpu_slab(s, smp_processor_id());
+	if (unlikely(!c->page || !c->freelist ||
+					!node_match(c, node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, page);
+		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = page->lockless_freelist;
-		page->lockless_freelist = object[page->offset];
+		object = c->freelist;
+		c->freelist = object[c->page->offset];
 	}
 	local_irq_restore(flags);
 
@@ -1666,13 +1678,14 @@ static void __always_inline slab_free(st
 {
 	void **object = (void *)x;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
 	debug_check_no_locks_freed(object, s->objsize);
-	if (likely(page == s->cpu_slab[smp_processor_id()] &&
-						!SlabDebug(page))) {
-		object[page->offset] = page->lockless_freelist;
-		page->lockless_freelist = object;
+	c = get_cpu_slab(s, smp_processor_id());
+	if (likely(page == c->page && !SlabDebug(page))) {
+		object[page->offset] = c->freelist;
+		c->freelist = object;
 	} else
 		__slab_free(s, page, x, addr);
 
@@ -1865,6 +1878,24 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+			struct kmem_cache_cpu *c)
+{
+	c->page = NULL;
+	c->freelist = NULL;
+	c->node = 0;
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, get_cpu_slab(s, cpu));
+
+	return 1;
+}
+
 static void init_kmem_cache_node(struct kmem_cache_node *n)
 {
 	n->nr_partial = 0;
@@ -2115,8 +2146,12 @@ static int kmem_cache_open(struct kmem_c
 #ifdef CONFIG_NUMA
 	s->defrag_ratio = 100;
 #endif
-	if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+		goto error;
+
+	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;
+
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
@@ -2658,7 +2693,7 @@ void __init kmem_cache_init(void)
 #endif
 
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct page *);
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
 		" CPUs=%d, Nodes=%d\n",
@@ -3261,11 +3296,14 @@ static unsigned long slab_objects(struct
 	per_cpu = nodes + nr_node_ids;
 
 	for_each_possible_cpu(cpu) {
-		struct page *page = s->cpu_slab[cpu];
-		int node;
+		struct page *page;
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 
+		if (!c)
+			continue;
+
+		page = c->page;
 		if (page) {
-			node = page_to_nid(page);
 			if (flags & SO_CPU) {
 				int x = 0;
 
@@ -3274,9 +3312,9 @@ static unsigned long slab_objects(struct
 				else
 					x = 1;
 				total += x;
-				nodes[node] += x;
+				nodes[c->node] += x;
 			}
-			per_cpu[node]++;
+			per_cpu[c->node]++;
 		}
 	}
 
@@ -3322,13 +3360,19 @@ static int any_slab_objects(struct kmem_
 	int node;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		if (s->cpu_slab[cpu])
+	for_each_possible_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c && c->page)
 			return 1;
+	}
 
-	for_each_node(node) {
+	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
+		if (!n)
+			continue;
+
 		if (n->nr_partial || atomic_long_read(&n->nr_slabs))
 			return 1;
 	}

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 2/6] SLUB: Do not use page->mapping
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
  2007-08-23  6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
@ 2007-08-23  6:46 ` Christoph Lameter
  2007-08-23  6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

[-- Attachment #1: 0006-SLUB-Do-not-use-page-mapping.patch --]
[-- Type: text/plain, Size: 2029 bytes --]

After moving the lockless_freelist to kmem_cache_cpu we no longer need
page->lockless_freelist. Restructure the use of the struct page fields in
such a way that we never touch the mapping field.

This is turn allows us to remove the special casing of SLUB when determining
the mapping of a page (needed for corner cases of virtual caches machines that
need to flush caches of processors mapping a page).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm_types.h |    9 ++-------
 mm/slub.c                |    2 --
 2 files changed, 2 insertions(+), 9 deletions(-)

Index: linux-2.6.23-rc3-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/mm_types.h	2007-08-22 17:14:32.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/mm_types.h	2007-08-22 17:20:13.000000000 -0700
@@ -62,13 +62,8 @@ struct page {
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 	    spinlock_t ptl;
 #endif
-	    struct {			/* SLUB uses */
-	    	void **lockless_freelist;
-		struct kmem_cache *slab;	/* Pointer to slab */
-	    };
-	    struct {
-		struct page *first_page;	/* Compound pages */
-	    };
+	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
+	    struct page *first_page;	/* Compound tail pages */
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c	2007-08-22 17:20:05.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c	2007-08-22 17:20:13.000000000 -0700
@@ -1125,7 +1125,6 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->lockless_freelist = NULL;
 	page->inuse = 0;
 out:
 	if (flags & __GFP_WAIT)
@@ -1151,7 +1150,6 @@ static void __free_slab(struct kmem_cach
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		- pages);
 
-	page->mapping = NULL;
 	__free_pages(page, s->order);
 }
 

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
  2007-08-23  6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
  2007-08-23  6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
@ 2007-08-23  6:46 ` Christoph Lameter
  2007-08-23  6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

[-- Attachment #1: 0007-SLUB-Move-page-offset-to-kmem_cache_cpu-offset.patch --]
[-- Type: text/plain, Size: 8512 bytes --]

We need the offset from the page struct during slab_alloc and slab_free. In
both cases we also reference the cacheline of the kmem_cache_cpu structure.
We can therefore move the offset field into the kmem_cache_cpu structure
freeing up 16 bits in the page struct.

Moving the offset allows an allocation from slab_alloc() without touching the
page struct in the hot path.

The only thing left in slab_free() that touches the page struct cacheline for
per cpu freeing is the checking of SlabDebug(page). The next patch deals with
that.

Use the available 16 bits to broaden page->inuse. More than 64k objects per
slab become possible and we can get rid of the checks for that limitation.

No need anymore to shrink the order of slabs if we boot with 2M sized slabs
(slub_min_order=9).

No need anymore to switch off the offset calculation for very large slabs
since the field in the kmem_cache_cpu structure is 32 bits and so the offset
field can now handle slab sizes of up to 8GB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm_types.h |    5 --
 include/linux/slub_def.h |    1 
 mm/slub.c                |   80 +++++++++--------------------------------------
 3 files changed, 18 insertions(+), 68 deletions(-)

Index: linux-2.6.23-rc3-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/mm_types.h	2007-08-22 17:20:13.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/mm_types.h	2007-08-22 17:20:28.000000000 -0700
@@ -37,10 +37,7 @@ struct page {
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
-		struct {	/* SLUB uses */
-			short unsigned int inuse;
-			short unsigned int offset;
-		};
+		unsigned int inuse;	/* SLUB: Nr of objects */
 	};
 	union {
 	    struct {
Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h	2007-08-22 17:18:56.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h	2007-08-22 17:23:29.000000000 -0700
@@ -15,6 +15,7 @@ struct kmem_cache_cpu {
 	void **freelist;
 	struct page *page;
 	int node;
+	unsigned int offset;
 	/* Lots of wasted space */
 } ____cacheline_aligned_in_smp;
 
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c	2007-08-22 17:20:13.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c	2007-08-22 17:23:36.000000000 -0700
@@ -207,11 +207,6 @@ static inline void ClearSlabDebug(struct
 #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
 #endif
 
-/*
- * The page->inuse field is 16 bit thus we have this limitation
- */
-#define MAX_OBJECTS_PER_SLAB 65535
-
 /* Internal SLUB flags */
 #define __OBJECT_POISON		0x80000000 /* Poison object */
 #define __SYSFS_ADD_DEFERRED	0x40000000 /* Not yet visible via sysfs */
@@ -736,11 +731,6 @@ static int check_slab(struct kmem_cache 
 		slab_err(s, page, "Not a valid slab page");
 		return 0;
 	}
-	if (page->offset * sizeof(void *) != s->offset) {
-		slab_err(s, page, "Corrupted offset %lu",
-			(unsigned long)(page->offset * sizeof(void *)));
-		return 0;
-	}
 	if (page->inuse > s->objects) {
 		slab_err(s, page, "inuse %u > max %u",
 			s->name, page->inuse, s->objects);
@@ -879,8 +869,6 @@ bad:
 		slab_fix(s, "Marking all objects used");
 		page->inuse = s->objects;
 		page->freelist = NULL;
-		/* Fix up fields that may be corrupted */
-		page->offset = s->offset / sizeof(void *);
 	}
 	return 0;
 }
@@ -996,30 +984,12 @@ __setup("slub_debug", setup_slub_debug);
 static void kmem_cache_open_debug_check(struct kmem_cache *s)
 {
 	/*
-	 * The page->offset field is only 16 bit wide. This is an offset
-	 * in units of words from the beginning of an object. If the slab
-	 * size is bigger then we cannot move the free pointer behind the
-	 * object anymore.
-	 *
-	 * On 32 bit platforms the limit is 256k. On 64bit platforms
-	 * the limit is 512k.
-	 *
-	 * Debugging or ctor may create a need to move the free
-	 * pointer. Fail if this happens.
+	 * Enable debugging if selected on the kernel commandline.
 	 */
-	if (s->objsize >= 65535 * sizeof(void *)) {
-		BUG_ON(s->flags & (SLAB_RED_ZONE | SLAB_POISON |
-				SLAB_STORE_USER | SLAB_DESTROY_BY_RCU));
-		BUG_ON(s->ctor);
-	}
-	else
-		/*
-		 * Enable debugging if selected on the kernel commandline.
-		 */
-		if (slub_debug && (!slub_debug_slabs ||
-		    strncmp(slub_debug_slabs, s->name,
-		    	strlen(slub_debug_slabs)) == 0))
-				s->flags |= slub_debug;
+	if (slub_debug && (!slub_debug_slabs ||
+		strncmp(slub_debug_slabs, s->name,
+		strlen(slub_debug_slabs)) == 0))
+			s->flags |= slub_debug;
 }
 #else
 static inline void setup_object_debug(struct kmem_cache *s,
@@ -1102,7 +1072,6 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
-	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
 	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
@@ -1396,10 +1365,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[page->offset];
+		c->freelist = c->freelist[c->offset];
 
 		/* And put onto the regular freelist */
-		object[page->offset] = page->freelist;
+		object[c->offset] = page->freelist;
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1495,7 +1464,7 @@ load_freelist:
 		goto debug;
 
 	object = c->page->freelist;
-	c->freelist = object[c->page->offset];
+	c->freelist = object[c->offset];
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1547,7 +1516,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->page->offset];
+	c->page->freelist = object[c->offset];
 	slab_unlock(c->page);
 	return object;
 }
@@ -1578,7 +1547,7 @@ static void __always_inline *slab_alloc(
 
 	else {
 		object = c->freelist;
-		c->freelist = object[c->page->offset];
+		c->freelist = object[c->offset];
 	}
 	local_irq_restore(flags);
 
@@ -1611,7 +1580,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-					void *x, void *addr)
+				void *x, void *addr, unsigned int offset)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1621,7 +1590,7 @@ static void __slab_free(struct kmem_cach
 	if (unlikely(SlabDebug(page)))
 		goto debug;
 checks_ok:
-	prior = object[page->offset] = page->freelist;
+	prior = object[offset] = page->freelist;
 	page->freelist = object;
 	page->inuse--;
 
@@ -1682,10 +1651,10 @@ static void __always_inline slab_free(st
 	debug_check_no_locks_freed(object, s->objsize);
 	c = get_cpu_slab(s, smp_processor_id());
 	if (likely(page == c->page && !SlabDebug(page))) {
-		object[page->offset] = c->freelist;
+		object[c->offset] = c->freelist;
 		c->freelist = object;
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, x, addr, c->offset);
 
 	local_irq_restore(flags);
 }
@@ -1777,14 +1746,6 @@ static inline int slab_order(int size, i
 	int rem;
 	int min_order = slub_min_order;
 
-	/*
-	 * If we would create too many object per slab then reduce
-	 * the slab order even if it goes below slub_min_order.
-	 */
-	while (min_order > 0 &&
-		(PAGE_SIZE << min_order) >= MAX_OBJECTS_PER_SLAB * size)
-			min_order--;
-
 	for (order = max(min_order,
 				fls(min_objects * size - 1) - PAGE_SHIFT);
 			order <= max_order; order++) {
@@ -1799,9 +1760,6 @@ static inline int slab_order(int size, i
 		if (rem <= slab_size / fract_leftover)
 			break;
 
-		/* If the next size is too high then exit now */
-		if (slab_size * 2 >= MAX_OBJECTS_PER_SLAB * size)
-			break;
 	}
 
 	return order;
@@ -1881,6 +1839,7 @@ static void init_kmem_cache_cpu(struct k
 {
 	c->page = NULL;
 	c->freelist = NULL;
+	c->offset = s->offset / sizeof(void *);
 	c->node = 0;
 }
 
@@ -2113,14 +2072,7 @@ static int calculate_sizes(struct kmem_c
 	 */
 	s->objects = (PAGE_SIZE << s->order) / size;
 
-	/*
-	 * Verify that the number of objects is within permitted limits.
-	 * The page->inuse field is only 16 bit wide! So we cannot have
-	 * more than 64k objects per slab.
-	 */
-	if (!s->objects || s->objects > MAX_OBJECTS_PER_SLAB)
-		return 0;
-	return 1;
+	return !!s->objects;
 
 }
 

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-08-23  6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
@ 2007-08-23  6:46 ` Christoph Lameter
  2007-08-23  6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

[-- Attachment #1: 0008-SLUB-Avoid-touching-page-struct-when-freeing-to-per.patch --]
[-- Type: text/plain, Size: 1334 bytes --]

Set c->node to -1 if we allocate from a debug slab instead for SlabDebug
which requires access the page struct cacheline.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c	2007-08-22 17:20:28.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c	2007-08-22 17:20:33.000000000 -0700
@@ -1517,6 +1517,7 @@ debug:
 
 	c->page->inuse++;
 	c->page->freelist = object[c->offset];
+	c->node = -1;
 	slab_unlock(c->page);
 	return object;
 }
@@ -1540,8 +1541,7 @@ static void __always_inline *slab_alloc(
 
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->page || !c->freelist ||
-					!node_match(c, node)))
+	if (unlikely(!c->freelist || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
@@ -1650,7 +1650,7 @@ static void __always_inline slab_free(st
 	local_irq_save(flags);
 	debug_check_no_locks_freed(object, s->objsize);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (likely(page == c->page && !SlabDebug(page))) {
+	if (likely(page == c->page && c->node >= 0)) {
 		object[c->offset] = c->freelist;
 		c->freelist = object;
 	} else

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-08-23  6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
@ 2007-08-23  6:46 ` Christoph Lameter
  2007-08-23  6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
  2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

[-- Attachment #1: 0009-SLUB-Place-kmem_cache_cpu-structures-in-a-NUMA-awar.patch --]
[-- Type: text/plain, Size: 9199 bytes --]

The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.

In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

However, the kmem_cache_cpu structures can now be allocated in a more
intelligent way.

We would like to put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. We set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.

Pro:
	- NUMA aware placement improves memory performance
	- All global structures in struct kmem_cache become readonly
	- Dense packing of per cpu structures reduces cacheline
	  footprint in SMP and NUMA.
	- Potential avoidance of exclusive cacheline fetches
	  on the free and alloc hotpath since multiple kmem_cache_cpu
	  structures are in one cacheline. This is particularly important
	  for the kmalloc array.

Cons:
	- Additional reference to one read only cacheline (per cpu
	  array of pointers to kmem_cache_cpu) in both slab_alloc()
	  and slab_free().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    9 +-
 mm/slub.c                |  162 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 154 insertions(+), 17 deletions(-)

Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h	2007-08-22 17:23:29.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h	2007-08-22 17:23:47.000000000 -0700
@@ -16,8 +16,7 @@ struct kmem_cache_cpu {
 	struct page *page;
 	int node;
 	unsigned int offset;
-	/* Lots of wasted space */
-} ____cacheline_aligned_in_smp;
+};
 
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
@@ -62,7 +61,11 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-	struct kmem_cache_cpu cpu_slab[NR_CPUS];
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
 };
 
 /*
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c	2007-08-22 17:23:40.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c	2007-08-22 17:23:47.000000000 -0700
@@ -276,7 +276,11 @@ static inline struct kmem_cache_node *ge
 
 static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	return &s->cpu_slab[cpu];
+#ifdef CONFIG_SMP
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
 }
 
 static inline int check_valid_pointer(struct kmem_cache *s,
@@ -1843,16 +1847,6 @@ static void init_kmem_cache_cpu(struct k
 	c->node = 0;
 }
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, get_cpu_slab(s, cpu));
-
-	return 1;
-}
-
 static void init_kmem_cache_node(struct kmem_cache_node *n)
 {
 	n->nr_partial = 0;
@@ -1864,6 +1858,125 @@ static void init_kmem_cache_node(struct 
 #endif
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Per cpu array for per cpu structures.
+ *
+ * The per cpu array places all kmem_cache_cpu structures from one processor
+ * close together meaning that it becomes possible that multiple per cpu
+ * structures are contained in one cacheline. This may be particularly
+ * beneficial for the kmalloc caches.
+ *
+ * A desktop system typically has around 60-80 slabs. With 100 here we are
+ * likely able to get per cpu structures for all caches from the array defined
+ * here. We must be able to cover all kmalloc caches during bootstrap.
+ *
+ * If the per cpu array is exhausted then fall back to kmalloc
+ * of individual cachelines. No sharing is possible then.
+ */
+#define NR_KMEM_CACHE_CPU 100
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu,
+				kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
+
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+							int cpu, gfp_t flags)
+{
+	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
+
+	if (c)
+		per_cpu(kmem_cache_cpu_free, cpu) =
+				(void *)c->freelist;
+	else {
+		/* Table overflow: So allocate ourselves */
+		c = kmalloc_node(
+			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
+			flags, cpu_to_node(cpu));
+		if (!c)
+			return NULL;
+	}
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
+{
+	if (c < per_cpu(kmem_cache_cpu, cpu) ||
+			c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
+		kfree(c);
+		return;
+	}
+	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
+	per_cpu(kmem_cache_cpu_free, cpu) = c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c) {
+			s->cpu_slab[cpu] = NULL;
+			free_kmem_cache_cpu(c, cpu);
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu, flags);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+/*
+ * Initialize the per cpu array.
+ */
+static void init_alloc_cpu_cpu(int cpu)
+{
+	int i;
+
+	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
+		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
+}
+
+static void __init init_alloc_cpu(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		init_alloc_cpu_cpu(cpu);
+  }
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
+static inline void init_alloc_cpu(void) {}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_NUMA
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
@@ -1871,7 +1984,8 @@ static void init_kmem_cache_node(struct 
  * possible.
  *
  * Note that this function only works on the kmalloc_node_cache
- * when allocating for the kmalloc_node_cache.
+ * when allocating for the kmalloc_node_cache. This is used for bootstrapping
+ * memory on a fresh node that has no slab structures yet.
  */
 static struct kmem_cache_node *early_kmem_cache_node_alloc(gfp_t gfpflags,
 							   int node)
@@ -2102,6 +2216,7 @@ static int kmem_cache_open(struct kmem_c
 	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;
 
+	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
@@ -2184,6 +2299,7 @@ static inline int kmem_cache_close(struc
 	flush_all(s);
 
 	/* Attempt to free all objects */
+	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2580,6 +2696,8 @@ void __init kmem_cache_init(void)
 		slub_min_objects = DEFAULT_ANTIFRAG_MIN_OBJECTS;
 	}
 
+	init_alloc_cpu();
+
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -2640,10 +2758,12 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
 #endif
 
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
 		" CPUs=%d, Nodes=%d\n",
@@ -2771,15 +2891,29 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		init_alloc_cpu_cpu(cpu);
+		down_read(&slub_lock);
+		list_for_each_entry(s, &slab_caches, list)
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
+							GFP_KERNEL);
+		up_read(&slub_lock);
+		break;
+
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
+			free_kmem_cache_cpu(c, cpu);
+			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 6/6] SLUB: Optimize cacheline use for zeroing
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-08-23  6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
@ 2007-08-23  6:46 ` Christoph Lameter
  2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23  6:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg

[-- Attachment #1: 0010-SLUB-Optimize-cacheline-use-for-zeroing.patch --]
[-- Type: text/plain, Size: 2592 bytes --]

We touch a cacheline in the kmem_cache structure for zeroing to get the
size. However, the hot paths in slab_alloc and slab_free do not reference
any other fields in kmem_cache, so we may have to just bring in the
cacheline for this one access.

Add a new field to kmem_cache_cpu that contains the object size. That
cacheline must already be used in the hotpaths. So we save one cacheline
on every slab_alloc if we zero.

We need to update the kmem_cache_cpu object size if an aliasing operation
changes the objsize of an non debug slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    1 +
 mm/slub.c                |   14 ++++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h	2007-08-22 17:23:47.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h	2007-08-22 17:23:50.000000000 -0700
@@ -16,6 +16,7 @@ struct kmem_cache_cpu {
 	struct page *page;
 	int node;
 	unsigned int offset;
+	unsigned int objsize;
 };
 
 struct kmem_cache_node {
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c	2007-08-22 17:23:47.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c	2007-08-22 17:23:50.000000000 -0700
@@ -1556,7 +1556,7 @@ static void __always_inline *slab_alloc(
 	local_irq_restore(flags);
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
-		memset(object, 0, s->objsize);
+		memset(object, 0, c->objsize);
 
 	return object;
 }
@@ -1843,8 +1843,9 @@ static void init_kmem_cache_cpu(struct k
 {
 	c->page = NULL;
 	c->freelist = NULL;
-	c->offset = s->offset / sizeof(void *);
 	c->node = 0;
+	c->offset = s->offset / sizeof(void *);
+	c->objsize = s->objsize;
 }
 
 static void init_kmem_cache_node(struct kmem_cache_node *n)
@@ -2842,12 +2843,21 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, ctor);
 	if (s) {
+		int cpu;
+
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
+
+		/*
+		 * And then we need to update the object size in the
+		 * per cpu structures
+		 */
+		for_each_online_cpu(cpu)
+			get_cpu_slab(s, cpu)->objsize = s->objsize;
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 		if (sysfs_slab_alias(s, name))

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/6] Per cpu structures for SLUB
  2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-08-23  6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
@ 2007-08-24 21:38 ` Andrew Morton
  2007-08-27 18:50   ` Christoph Lameter
  6 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2007-08-24 21:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg

On Wed, 22 Aug 2007 23:46:53 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> The following patchset introduces per cpu structures for SLUB. These
> are very small (and multiples of these may fit into one cacheline)
> and (apart from performance improvements) allow the addressing of
> several isues in SLUB:
> 
> 1. The number of objects per slab is no longer limited to a 16 bit
>    number.
> 
> 2. Room is freed up in the page struct. We can avoid using the
>    mapping field which allows to get rid of the #ifdef CONFIG_SLUB
>    in page_mapping().
> 
> 3. We will have an easier time adding new things like Peter Z.s reserve
>    management.
> 
> The RFC for this patchset was discussed on lkml a while ago:
> 
> http://marc.info/?l=linux-kernel&m=118386677704534&w=2
> 
> (And no this patchset does not include the use of cmpxchg_local that
> we discussed recently on lkml nor the cmpxchg implementation
> mentioned in the RFC)
> 
> Performance
> -----------
> 
> 
> Norm = 2.6.23-rc3
> PCPU = Adds page allocator pass through plus per cpu structure patches
> 
> 
> IA64 8p 4n NUMA Altix
> 
>             Single threaded               Concurrent Alloc
> 
> 	Kmalloc		Alloc/Free	Kmalloc         Alloc/Free
>  Size	Norm   PCPU	Norm   PCPU	Norm   PCPU	Norm   PCPU
> -------------------------------------------------------------------
>     8	132	84	93	104	98	90	95	106
>    16    98	92	93	104	115	98	95	106
>    32   112	105	93	104	146	111	95	106
>    64	119	112	93	104	214	133	95	106
>   128   132	119	94	104	321	163	95	106
>   256+  83255	176	106	115	415	224	108	117
>   512   191	176	106	115	487	341	108	117
>  1024   252	246	106	115	937	609	108	117
>  2048   308	292	107	115	2494	1207	108	117
>  4096   341	319	107	115	2497	1217	108	117
>  8192   402	380	107	115	2367	1188	108	117
> 16384*  560	474	106	434	4464	1904	108	478
> 
> X86_64 2p SMP (Dual Core Pentium 940)
> 
>          Single threaded                   Concurrent Alloc
> 
>         Kmalloc         Alloc/Free      Kmalloc         Alloc/Free
>  Size   Norm   PCPU     Norm   PCPU     Norm   PCPU     Norm   PCPU
> --------------------------------------------------------------------
>     8	313	227	314	324	207	208	314	323
>    16   202	203	315	324	209	211	312	321
>    32	212	207	314	324	251	243	312	321
>    64	240	237	314	326	329	306	312	321
>   128	301	302	314	324	511	416	313	324
>   256   498	554	327	332	970	837	326	332
>   512   532	553	324	332	1025	932	326	335
>  1024   705	718	325	333	1489	1231	324	330
>  2048   764	767	324	334	2708	2175	324	332
>  4096* 1033	476	325	674	4727	782	324	678

I'm struggling a bit to understand these numbers.  Bigger is better, I
assume?  In what units are these numbers?

> Notes:
> 
> Worst case:
> -----------
> We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
> since the processing overhead increases because we need to lookup
> the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
> So objects with the shortest lifetime possible. We would never use
> objects in that way but the measurement is important to show the worst
> case overhead created.
> 
> Single Threaded:
> ----------------
> The single threaded kmalloc test shows behavior of a continual stream
> of allocation without contention. In the SMP case the losses are minimal.
> In the NUMA case we already have a winner there because the per cpu structure
> is placed local to the processor. So in the single threaded case we already
> win around 5% just by placing things better.
> 
> Concurrent Alloc:
> -----------------
> We have varying gains up to a 50% on NUMA because we are now never updating
> a cacheline used by the other processor and the data structures are local
> to the processor.
> 
> The SMP case shows gains but they are smaller (especially since
> this is the smallest SMP system possible.... 2 CPUs). So only up
> to 25%.
> 
> Page allocator pass through
> ---------------------------
> There is a significant difference in the columns marked with a * because
> of the way that allocations for page sized objects are handled.

OK, but what happened to the third pair of columns (Concurrent Alloc,
Kmalloc) for 1024 and 2048-byte allocations?  They seem to have become
significantly slower?

Thanks for running the numbers, but it's still a bit hard to work out
whether these changes are an aggregate benefit?

> If we handle
> the allocations in the slab allocator (Norm) then the alloc free tests
> results are superb since we can use the per cpu slab to just pass a pointer
> back and forth. The page allocator pass through (PCPU) shows that the page
> allocator may have problems with giving back the same page after a free.
> Or there something else in the page allocator that creates significant
> overhead compared to slab. Needs to be checked out I guess.
> 
> However, the page allocator pass through is a win in the other cases
> since we can cut out the page allocator overhead. That is the more typical
> load of allocating a sequence of objects and we should optimize for that.
> 
> (+ = Must be some cache artifact here or code crossing a TLB boundary.
> The result is reproducable)
> 

Most Linux machines are uniprocessor.  We should keep an eye on what effect
a change like this has on code size and performance for CONFIG_SMP=n
builds..


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/6] Per cpu structures for SLUB
  2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
@ 2007-08-27 18:50   ` Christoph Lameter
  2007-08-27 23:51     ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2007-08-27 18:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg

On Fri, 24 Aug 2007, Andrew Morton wrote:

> I'm struggling a bit to understand these numbers.  Bigger is better, I
> assume?  In what units are these numbers?

No less is better. These are cycle counts. Hmmm... We discussed these 
cycle counts so much in the last week that I forgot to mention that.

> > Page allocator pass through
> > ---------------------------
> > There is a significant difference in the columns marked with a * because
> > of the way that allocations for page sized objects are handled.
> 
> OK, but what happened to the third pair of columns (Concurrent Alloc,
> Kmalloc) for 1024 and 2048-byte allocations?  They seem to have become
> significantly slower?

There is a significant performance increase there. That is the main point 
of the patch.

> Thanks for running the numbers, but it's still a bit hard to work out
> whether these changes are an aggregate benefit?

There is a drawback because of the additional code introduced in the fast 
path. However, the regular kmalloc case shows improvements throughout. 
This is in particular of importance for SMP systems. We see an improvement 
even for 2 processors.

> > If we handle
> > the allocations in the slab allocator (Norm) then the alloc free tests
> > results are superb since we can use the per cpu slab to just pass a pointer
> > back and forth. The page allocator pass through (PCPU) shows that the page
> > allocator may have problems with giving back the same page after a free.
> > Or there something else in the page allocator that creates significant
> > overhead compared to slab. Needs to be checked out I guess.
> > 
> > However, the page allocator pass through is a win in the other cases
> > since we can cut out the page allocator overhead. That is the more typical
> > load of allocating a sequence of objects and we should optimize for that.
> > 
> > (+ = Must be some cache artifact here or code crossing a TLB boundary.
> > The result is reproducable)
> > 
> 
> Most Linux machines are uniprocessor.  We should keep an eye on what effect
> a change like this has on code size and performance for CONFIG_SMP=n
> builds..

There is an #ifdef around ther per cpu structure management code. All of 
this will vanish (including the lookup of the per cpu address from the 
fast path) if SMP is off.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/6] Per cpu structures for SLUB
  2007-08-27 18:50   ` Christoph Lameter
@ 2007-08-27 23:51     ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2007-08-27 23:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg

On Mon, 27 Aug 2007 11:50:10 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 24 Aug 2007, Andrew Morton wrote:
> 
> > I'm struggling a bit to understand these numbers.  Bigger is better, I
> > assume?  In what units are these numbers?
> 
> No less is better. These are cycle counts. Hmmm... We discussed these 
> cycle counts so much in the last week that I forgot to mention that.
> 
> > > Page allocator pass through
> > > ---------------------------
> > > There is a significant difference in the columns marked with a * because
> > > of the way that allocations for page sized objects are handled.
> > 
> > OK, but what happened to the third pair of columns (Concurrent Alloc,
> > Kmalloc) for 1024 and 2048-byte allocations?  They seem to have become
> > significantly slower?
> 
> There is a significant performance increase there. That is the main point 
> of the patch.
> 
> > Thanks for running the numbers, but it's still a bit hard to work out
> > whether these changes are an aggregate benefit?
> 
> There is a drawback because of the additional code introduced in the fast 
> path. However, the regular kmalloc case shows improvements throughout. 
> This is in particular of importance for SMP systems. We see an improvement 
> even for 2 processors.

umm, OK.  When you have time, could you please whizz up a clearer
changelog for this one?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-08-27 23:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-23  6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
2007-08-23  6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
2007-08-23  6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
2007-08-23  6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
2007-08-23  6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
2007-08-23  6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
2007-08-23  6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
2007-08-27 18:50   ` Christoph Lameter
2007-08-27 23:51     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).