* [patch 0/6] Per cpu structures for SLUB
@ 2007-08-23 6:46 Christoph Lameter
2007-08-23 6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
` (6 more replies)
0 siblings, 7 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
The following patchset introduces per cpu structures for SLUB. These
are very small (and multiples of these may fit into one cacheline)
and (apart from performance improvements) allow the addressing of
several isues in SLUB:
1. The number of objects per slab is no longer limited to a 16 bit
number.
2. Room is freed up in the page struct. We can avoid using the
mapping field which allows to get rid of the #ifdef CONFIG_SLUB
in page_mapping().
3. We will have an easier time adding new things like Peter Z.s reserve
management.
The RFC for this patchset was discussed on lkml a while ago:
http://marc.info/?l=linux-kernel&m=118386677704534&w=2
(And no this patchset does not include the use of cmpxchg_local that
we discussed recently on lkml nor the cmpxchg implementation
mentioned in the RFC)
Performance
-----------
Norm = 2.6.23-rc3
PCPU = Adds page allocator pass through plus per cpu structure patches
IA64 8p 4n NUMA Altix
Single threaded Concurrent Alloc
Kmalloc Alloc/Free Kmalloc Alloc/Free
Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
-------------------------------------------------------------------
8 132 84 93 104 98 90 95 106
16 98 92 93 104 115 98 95 106
32 112 105 93 104 146 111 95 106
64 119 112 93 104 214 133 95 106
128 132 119 94 104 321 163 95 106
256+ 83255 176 106 115 415 224 108 117
512 191 176 106 115 487 341 108 117
1024 252 246 106 115 937 609 108 117
2048 308 292 107 115 2494 1207 108 117
4096 341 319 107 115 2497 1217 108 117
8192 402 380 107 115 2367 1188 108 117
16384* 560 474 106 434 4464 1904 108 478
X86_64 2p SMP (Dual Core Pentium 940)
Single threaded Concurrent Alloc
Kmalloc Alloc/Free Kmalloc Alloc/Free
Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
--------------------------------------------------------------------
8 313 227 314 324 207 208 314 323
16 202 203 315 324 209 211 312 321
32 212 207 314 324 251 243 312 321
64 240 237 314 326 329 306 312 321
128 301 302 314 324 511 416 313 324
256 498 554 327 332 970 837 326 332
512 532 553 324 332 1025 932 326 335
1024 705 718 325 333 1489 1231 324 330
2048 764 767 324 334 2708 2175 324 332
4096* 1033 476 325 674 4727 782 324 678
Notes:
Worst case:
-----------
We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
since the processing overhead increases because we need to lookup
the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
So objects with the shortest lifetime possible. We would never use
objects in that way but the measurement is important to show the worst
case overhead created.
Single Threaded:
----------------
The single threaded kmalloc test shows behavior of a continual stream
of allocation without contention. In the SMP case the losses are minimal.
In the NUMA case we already have a winner there because the per cpu structure
is placed local to the processor. So in the single threaded case we already
win around 5% just by placing things better.
Concurrent Alloc:
-----------------
We have varying gains up to a 50% on NUMA because we are now never updating
a cacheline used by the other processor and the data structures are local
to the processor.
The SMP case shows gains but they are smaller (especially since
this is the smallest SMP system possible.... 2 CPUs). So only up
to 25%.
Page allocator pass through
---------------------------
There is a significant difference in the columns marked with a * because
of the way that allocations for page sized objects are handled. If we handle
the allocations in the slab allocator (Norm) then the alloc free tests
results are superb since we can use the per cpu slab to just pass a pointer
back and forth. The page allocator pass through (PCPU) shows that the page
allocator may have problems with giving back the same page after a free.
Or there something else in the page allocator that creates significant
overhead compared to slab. Needs to be checked out I guess.
However, the page allocator pass through is a win in the other cases
since we can cut out the page allocator overhead. That is the more typical
load of allocating a sequence of objects and we should optimize for that.
(+ = Must be some cache artifact here or code crossing a TLB boundary.
The result is reproducable)
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
@ 2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
` (5 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
[-- Attachment #1: 0005-SLUB-Avoid-page-struct-cacheline-bouncing-due-to-re.patch --]
[-- Type: text/plain, Size: 16455 bytes --]
A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.
This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.
Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.
This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.
The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.
The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.
If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.
It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.
More code cleanups become possible:
- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
checks.
Effect on allocations:
Cachelines touched before this patch:
Write: page cache struct and first cacheline of object
Cachelines touched after this patch:
Write: kmem_cache_cpu cacheline and first cacheline of object
Read: page cache struct (but see later patch that avoids touching
that cacheline)
The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.
Effect on freeing:
Cachelines touched before this patch:
Write: page_struct and first cacheline of object
Cachelines touched after this patch depending on how we free:
Write(to cpu_slab): kmem_cache_cpu struct and first cacheline of object
Write(to other): page struct and first cacheline of object
Read(to cpu_slab): page struct to id slab etc. (but see later patch that
avoids touching the page struct on free)
Read(to other): cpu local kmem_cache_cpu struct to verify its not
the cpu slab.
Summary:
Pro:
- Distinct cachelines so that concurrent remote frees and local
allocs on a cpuslab can occur without cacheline bouncing.
- Avoids potential bouncing cachelines because of neighboring
per cpu pointer updates in kmem_cache's cpu_slab structure since
it now grows to a cacheline (Therefore remove the comment
that talks about that concern).
Cons:
- Freeing objects now requires the reading of one additional
cacheline. That can be mitigated for some cases by the following
patches but its not possible to completely eliminate these
references.
- Memory usage grows slightly.
The size of each per cpu object is blown up from one word
(pointing to the page_struct) to one cacheline with various data.
So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
NR_SLABS is 100 and a cache line size of 128 then we have just
increased SLAB metadata requirements by 12.8k per cpu.
(Another later patch reduces these requirements)
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 9 +-
mm/slub.c | 192 ++++++++++++++++++++++++++++-------------------
2 files changed, 126 insertions(+), 75 deletions(-)
Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h 2007-08-22 17:14:32.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h 2007-08-22 17:18:56.000000000 -0700
@@ -11,6 +11,13 @@
#include <linux/workqueue.h>
#include <linux/kobject.h>
+struct kmem_cache_cpu {
+ void **freelist;
+ struct page *page;
+ int node;
+ /* Lots of wasted space */
+} ____cacheline_aligned_in_smp;
+
struct kmem_cache_node {
spinlock_t list_lock; /* Protect partial list and nr_partial */
unsigned long nr_partial;
@@ -54,7 +61,7 @@ struct kmem_cache {
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
- struct page *cpu_slab[NR_CPUS];
+ struct kmem_cache_cpu cpu_slab[NR_CPUS];
};
/*
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c 2007-08-22 17:18:50.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c 2007-08-22 17:20:05.000000000 -0700
@@ -90,7 +90,7 @@
* One use of this flag is to mark slabs that are
* used for allocations. Then such a slab becomes a cpu
* slab. The cpu slab may be equipped with an additional
- * lockless_freelist that allows lockless access to
+ * freelist that allows lockless access to
* free objects in addition to the regular freelist
* that requires the slab lock.
*
@@ -140,11 +140,6 @@ static inline void ClearSlabDebug(struct
/*
* Issues still to be resolved:
*
- * - The per cpu array is updated for each new slab and and is a remote
- * cacheline for most nodes. This could become a bouncing cacheline given
- * enough frequent updates. There are 16 pointers in a cacheline, so at
- * max 16 cpus could compete for the cacheline which may be okay.
- *
* - Support PAGE_ALLOC_DEBUG. Should be easy to do.
*
* - Variable sizing of the per node arrays
@@ -284,6 +279,11 @@ static inline struct kmem_cache_node *ge
#endif
}
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+ return &s->cpu_slab[cpu];
+}
+
static inline int check_valid_pointer(struct kmem_cache *s,
struct page *page, const void *object)
{
@@ -1385,33 +1385,34 @@ static void unfreeze_slab(struct kmem_ca
/*
* Remove the cpu slab
*/
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
+ struct page *page = c->page;
/*
* Merge cpu freelist into freelist. Typically we get here
* because both freelists are empty. So this is unlikely
* to occur.
*/
- while (unlikely(page->lockless_freelist)) {
+ while (unlikely(c->freelist)) {
void **object;
/* Retrieve object from cpu_freelist */
- object = page->lockless_freelist;
- page->lockless_freelist = page->lockless_freelist[page->offset];
+ object = c->freelist;
+ c->freelist = c->freelist[page->offset];
/* And put onto the regular freelist */
object[page->offset] = page->freelist;
page->freelist = object;
page->inuse--;
}
- s->cpu_slab[cpu] = NULL;
+ c->page = NULL;
unfreeze_slab(s, page);
}
-static inline void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
- slab_lock(page);
- deactivate_slab(s, page, cpu);
+ slab_lock(c->page);
+ deactivate_slab(s, c);
}
/*
@@ -1420,18 +1421,17 @@ static inline void flush_slab(struct kme
*/
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
- struct page *page = s->cpu_slab[cpu];
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
- if (likely(page))
- flush_slab(s, page, cpu);
+ if (likely(c && c->page))
+ flush_slab(s, c);
}
static void flush_cpu_slab(void *d)
{
struct kmem_cache *s = d;
- int cpu = smp_processor_id();
- __flush_cpu_slab(s, cpu);
+ __flush_cpu_slab(s, smp_processor_id());
}
static void flush_all(struct kmem_cache *s)
@@ -1448,6 +1448,19 @@ static void flush_all(struct kmem_cache
}
/*
+ * Check if the objects in a per cpu structure fit numa
+ * locality expectations.
+ */
+static inline int node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+ if (node != -1 && c->node != node)
+ return 0;
+#endif
+ return 1;
+}
+
+/*
* Slow path. The lockless freelist is empty or we need to perform
* debugging duties.
*
@@ -1465,45 +1478,46 @@ static void flush_all(struct kmem_cache
* we need to allocate a new slab. This is slowest path since we may sleep.
*/
static void *__slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node, void *addr, struct page *page)
+ gfp_t gfpflags, int node, void *addr, struct kmem_cache_cpu *c)
{
void **object;
- int cpu = smp_processor_id();
+ struct page *new;
- if (!page)
+ if (!c->page)
goto new_slab;
- slab_lock(page);
- if (unlikely(node != -1 && page_to_nid(page) != node))
+ slab_lock(c->page);
+ if (unlikely(!node_match(c, node)))
goto another_slab;
load_freelist:
- object = page->freelist;
+ object = c->page->freelist;
if (unlikely(!object))
goto another_slab;
- if (unlikely(SlabDebug(page)))
+ if (unlikely(SlabDebug(c->page)))
goto debug;
- object = page->freelist;
- page->lockless_freelist = object[page->offset];
- page->inuse = s->objects;
- page->freelist = NULL;
- slab_unlock(page);
+ object = c->page->freelist;
+ c->freelist = object[c->page->offset];
+ c->page->inuse = s->objects;
+ c->page->freelist = NULL;
+ c->node = page_to_nid(c->page);
+ slab_unlock(c->page);
return object;
another_slab:
- deactivate_slab(s, page, cpu);
+ deactivate_slab(s, c);
new_slab:
- page = get_partial(s, gfpflags, node);
- if (page) {
- s->cpu_slab[cpu] = page;
+ new = get_partial(s, gfpflags, node);
+ if (new) {
+ c->page = new;
goto load_freelist;
}
- page = new_slab(s, gfpflags, node);
- if (page) {
- cpu = smp_processor_id();
- if (s->cpu_slab[cpu]) {
+ new = new_slab(s, gfpflags, node);
+ if (new) {
+ c = get_cpu_slab(s, smp_processor_id());
+ if (c->page) {
/*
* Someone else populated the cpu_slab while we
* enabled interrupts, or we have gotten scheduled
@@ -1511,34 +1525,32 @@ new_slab:
* requested node even if __GFP_THISNODE was
* specified. So we need to recheck.
*/
- if (node == -1 ||
- page_to_nid(s->cpu_slab[cpu]) == node) {
+ if (node_match(c, node)) {
/*
* Current cpuslab is acceptable and we
* want the current one since its cache hot
*/
- discard_slab(s, page);
- page = s->cpu_slab[cpu];
- slab_lock(page);
+ discard_slab(s, new);
+ slab_lock(c->page);
goto load_freelist;
}
/* New slab does not fit our expectations */
- flush_slab(s, s->cpu_slab[cpu], cpu);
+ flush_slab(s, c);
}
- slab_lock(page);
- SetSlabFrozen(page);
- s->cpu_slab[cpu] = page;
+ slab_lock(new);
+ SetSlabFrozen(new);
+ c->page = new;
goto load_freelist;
}
return NULL;
debug:
- object = page->freelist;
- if (!alloc_debug_processing(s, page, object, addr))
+ object = c->page->freelist;
+ if (!alloc_debug_processing(s, c->page, object, addr))
goto another_slab;
- page->inuse++;
- page->freelist = object[page->offset];
- slab_unlock(page);
+ c->page->inuse++;
+ c->page->freelist = object[c->page->offset];
+ slab_unlock(c->page);
return object;
}
@@ -1555,20 +1567,20 @@ debug:
static void __always_inline *slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node, void *addr)
{
- struct page *page;
void **object;
unsigned long flags;
+ struct kmem_cache_cpu *c;
local_irq_save(flags);
- page = s->cpu_slab[smp_processor_id()];
- if (unlikely(!page || !page->lockless_freelist ||
- (node != -1 && page_to_nid(page) != node)))
+ c = get_cpu_slab(s, smp_processor_id());
+ if (unlikely(!c->page || !c->freelist ||
+ !node_match(c, node)))
- object = __slab_alloc(s, gfpflags, node, addr, page);
+ object = __slab_alloc(s, gfpflags, node, addr, c);
else {
- object = page->lockless_freelist;
- page->lockless_freelist = object[page->offset];
+ object = c->freelist;
+ c->freelist = object[c->page->offset];
}
local_irq_restore(flags);
@@ -1666,13 +1678,14 @@ static void __always_inline slab_free(st
{
void **object = (void *)x;
unsigned long flags;
+ struct kmem_cache_cpu *c;
local_irq_save(flags);
debug_check_no_locks_freed(object, s->objsize);
- if (likely(page == s->cpu_slab[smp_processor_id()] &&
- !SlabDebug(page))) {
- object[page->offset] = page->lockless_freelist;
- page->lockless_freelist = object;
+ c = get_cpu_slab(s, smp_processor_id());
+ if (likely(page == c->page && !SlabDebug(page))) {
+ object[page->offset] = c->freelist;
+ c->freelist = object;
} else
__slab_free(s, page, x, addr);
@@ -1865,6 +1878,24 @@ static unsigned long calculate_alignment
return ALIGN(align, sizeof(void *));
}
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ c->page = NULL;
+ c->freelist = NULL;
+ c->node = 0;
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ init_kmem_cache_cpu(s, get_cpu_slab(s, cpu));
+
+ return 1;
+}
+
static void init_kmem_cache_node(struct kmem_cache_node *n)
{
n->nr_partial = 0;
@@ -2115,8 +2146,12 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
- if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+ if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+ goto error;
+
+ if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
return 1;
+
error:
if (flags & SLAB_PANIC)
panic("Cannot create slab %s size=%lu realsize=%u "
@@ -2658,7 +2693,7 @@ void __init kmem_cache_init(void)
#endif
kmem_size = offsetof(struct kmem_cache, cpu_slab) +
- nr_cpu_ids * sizeof(struct page *);
+ nr_cpu_ids * sizeof(struct kmem_cache_cpu);
printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
" CPUs=%d, Nodes=%d\n",
@@ -3261,11 +3296,14 @@ static unsigned long slab_objects(struct
per_cpu = nodes + nr_node_ids;
for_each_possible_cpu(cpu) {
- struct page *page = s->cpu_slab[cpu];
- int node;
+ struct page *page;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ if (!c)
+ continue;
+
+ page = c->page;
if (page) {
- node = page_to_nid(page);
if (flags & SO_CPU) {
int x = 0;
@@ -3274,9 +3312,9 @@ static unsigned long slab_objects(struct
else
x = 1;
total += x;
- nodes[node] += x;
+ nodes[c->node] += x;
}
- per_cpu[node]++;
+ per_cpu[c->node]++;
}
}
@@ -3322,13 +3360,19 @@ static int any_slab_objects(struct kmem_
int node;
int cpu;
- for_each_possible_cpu(cpu)
- if (s->cpu_slab[cpu])
+ for_each_possible_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+ if (c && c->page)
return 1;
+ }
- for_each_node(node) {
+ for_each_online_node(node) {
struct kmem_cache_node *n = get_node(s, node);
+ if (!n)
+ continue;
+
if (n->nr_partial || atomic_long_read(&n->nr_slabs))
return 1;
}
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* [patch 2/6] SLUB: Do not use page->mapping
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
2007-08-23 6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
@ 2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
` (4 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
[-- Attachment #1: 0006-SLUB-Do-not-use-page-mapping.patch --]
[-- Type: text/plain, Size: 2029 bytes --]
After moving the lockless_freelist to kmem_cache_cpu we no longer need
page->lockless_freelist. Restructure the use of the struct page fields in
such a way that we never touch the mapping field.
This is turn allows us to remove the special casing of SLUB when determining
the mapping of a page (needed for corner cases of virtual caches machines that
need to flush caches of processors mapping a page).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mm_types.h | 9 ++-------
mm/slub.c | 2 --
2 files changed, 2 insertions(+), 9 deletions(-)
Index: linux-2.6.23-rc3-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/mm_types.h 2007-08-22 17:14:32.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/mm_types.h 2007-08-22 17:20:13.000000000 -0700
@@ -62,13 +62,8 @@ struct page {
#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
spinlock_t ptl;
#endif
- struct { /* SLUB uses */
- void **lockless_freelist;
- struct kmem_cache *slab; /* Pointer to slab */
- };
- struct {
- struct page *first_page; /* Compound pages */
- };
+ struct kmem_cache *slab; /* SLUB: Pointer to slab */
+ struct page *first_page; /* Compound tail pages */
};
union {
pgoff_t index; /* Our offset within mapping. */
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c 2007-08-22 17:20:05.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c 2007-08-22 17:20:13.000000000 -0700
@@ -1125,7 +1125,6 @@ static struct page *new_slab(struct kmem
set_freepointer(s, last, NULL);
page->freelist = start;
- page->lockless_freelist = NULL;
page->inuse = 0;
out:
if (flags & __GFP_WAIT)
@@ -1151,7 +1150,6 @@ static void __free_slab(struct kmem_cach
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
- pages);
- page->mapping = NULL;
__free_pages(page, s->order);
}
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
2007-08-23 6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
2007-08-23 6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
@ 2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
` (3 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
[-- Attachment #1: 0007-SLUB-Move-page-offset-to-kmem_cache_cpu-offset.patch --]
[-- Type: text/plain, Size: 8512 bytes --]
We need the offset from the page struct during slab_alloc and slab_free. In
both cases we also reference the cacheline of the kmem_cache_cpu structure.
We can therefore move the offset field into the kmem_cache_cpu structure
freeing up 16 bits in the page struct.
Moving the offset allows an allocation from slab_alloc() without touching the
page struct in the hot path.
The only thing left in slab_free() that touches the page struct cacheline for
per cpu freeing is the checking of SlabDebug(page). The next patch deals with
that.
Use the available 16 bits to broaden page->inuse. More than 64k objects per
slab become possible and we can get rid of the checks for that limitation.
No need anymore to shrink the order of slabs if we boot with 2M sized slabs
(slub_min_order=9).
No need anymore to switch off the offset calculation for very large slabs
since the field in the kmem_cache_cpu structure is 32 bits and so the offset
field can now handle slab sizes of up to 8GB.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mm_types.h | 5 --
include/linux/slub_def.h | 1
mm/slub.c | 80 +++++++++--------------------------------------
3 files changed, 18 insertions(+), 68 deletions(-)
Index: linux-2.6.23-rc3-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/mm_types.h 2007-08-22 17:20:13.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/mm_types.h 2007-08-22 17:20:28.000000000 -0700
@@ -37,10 +37,7 @@ struct page {
* to show when page is mapped
* & limit reverse map searches.
*/
- struct { /* SLUB uses */
- short unsigned int inuse;
- short unsigned int offset;
- };
+ unsigned int inuse; /* SLUB: Nr of objects */
};
union {
struct {
Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h 2007-08-22 17:18:56.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h 2007-08-22 17:23:29.000000000 -0700
@@ -15,6 +15,7 @@ struct kmem_cache_cpu {
void **freelist;
struct page *page;
int node;
+ unsigned int offset;
/* Lots of wasted space */
} ____cacheline_aligned_in_smp;
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c 2007-08-22 17:20:13.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c 2007-08-22 17:23:36.000000000 -0700
@@ -207,11 +207,6 @@ static inline void ClearSlabDebug(struct
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
#endif
-/*
- * The page->inuse field is 16 bit thus we have this limitation
- */
-#define MAX_OBJECTS_PER_SLAB 65535
-
/* Internal SLUB flags */
#define __OBJECT_POISON 0x80000000 /* Poison object */
#define __SYSFS_ADD_DEFERRED 0x40000000 /* Not yet visible via sysfs */
@@ -736,11 +731,6 @@ static int check_slab(struct kmem_cache
slab_err(s, page, "Not a valid slab page");
return 0;
}
- if (page->offset * sizeof(void *) != s->offset) {
- slab_err(s, page, "Corrupted offset %lu",
- (unsigned long)(page->offset * sizeof(void *)));
- return 0;
- }
if (page->inuse > s->objects) {
slab_err(s, page, "inuse %u > max %u",
s->name, page->inuse, s->objects);
@@ -879,8 +869,6 @@ bad:
slab_fix(s, "Marking all objects used");
page->inuse = s->objects;
page->freelist = NULL;
- /* Fix up fields that may be corrupted */
- page->offset = s->offset / sizeof(void *);
}
return 0;
}
@@ -996,30 +984,12 @@ __setup("slub_debug", setup_slub_debug);
static void kmem_cache_open_debug_check(struct kmem_cache *s)
{
/*
- * The page->offset field is only 16 bit wide. This is an offset
- * in units of words from the beginning of an object. If the slab
- * size is bigger then we cannot move the free pointer behind the
- * object anymore.
- *
- * On 32 bit platforms the limit is 256k. On 64bit platforms
- * the limit is 512k.
- *
- * Debugging or ctor may create a need to move the free
- * pointer. Fail if this happens.
+ * Enable debugging if selected on the kernel commandline.
*/
- if (s->objsize >= 65535 * sizeof(void *)) {
- BUG_ON(s->flags & (SLAB_RED_ZONE | SLAB_POISON |
- SLAB_STORE_USER | SLAB_DESTROY_BY_RCU));
- BUG_ON(s->ctor);
- }
- else
- /*
- * Enable debugging if selected on the kernel commandline.
- */
- if (slub_debug && (!slub_debug_slabs ||
- strncmp(slub_debug_slabs, s->name,
- strlen(slub_debug_slabs)) == 0))
- s->flags |= slub_debug;
+ if (slub_debug && (!slub_debug_slabs ||
+ strncmp(slub_debug_slabs, s->name,
+ strlen(slub_debug_slabs)) == 0))
+ s->flags |= slub_debug;
}
#else
static inline void setup_object_debug(struct kmem_cache *s,
@@ -1102,7 +1072,6 @@ static struct page *new_slab(struct kmem
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
- page->offset = s->offset / sizeof(void *);
page->slab = s;
page->flags |= 1 << PG_slab;
if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
@@ -1396,10 +1365,10 @@ static void deactivate_slab(struct kmem_
/* Retrieve object from cpu_freelist */
object = c->freelist;
- c->freelist = c->freelist[page->offset];
+ c->freelist = c->freelist[c->offset];
/* And put onto the regular freelist */
- object[page->offset] = page->freelist;
+ object[c->offset] = page->freelist;
page->freelist = object;
page->inuse--;
}
@@ -1495,7 +1464,7 @@ load_freelist:
goto debug;
object = c->page->freelist;
- c->freelist = object[c->page->offset];
+ c->freelist = object[c->offset];
c->page->inuse = s->objects;
c->page->freelist = NULL;
c->node = page_to_nid(c->page);
@@ -1547,7 +1516,7 @@ debug:
goto another_slab;
c->page->inuse++;
- c->page->freelist = object[c->page->offset];
+ c->page->freelist = object[c->offset];
slab_unlock(c->page);
return object;
}
@@ -1578,7 +1547,7 @@ static void __always_inline *slab_alloc(
else {
object = c->freelist;
- c->freelist = object[c->page->offset];
+ c->freelist = object[c->offset];
}
local_irq_restore(flags);
@@ -1611,7 +1580,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
* handling required then we can return immediately.
*/
static void __slab_free(struct kmem_cache *s, struct page *page,
- void *x, void *addr)
+ void *x, void *addr, unsigned int offset)
{
void *prior;
void **object = (void *)x;
@@ -1621,7 +1590,7 @@ static void __slab_free(struct kmem_cach
if (unlikely(SlabDebug(page)))
goto debug;
checks_ok:
- prior = object[page->offset] = page->freelist;
+ prior = object[offset] = page->freelist;
page->freelist = object;
page->inuse--;
@@ -1682,10 +1651,10 @@ static void __always_inline slab_free(st
debug_check_no_locks_freed(object, s->objsize);
c = get_cpu_slab(s, smp_processor_id());
if (likely(page == c->page && !SlabDebug(page))) {
- object[page->offset] = c->freelist;
+ object[c->offset] = c->freelist;
c->freelist = object;
} else
- __slab_free(s, page, x, addr);
+ __slab_free(s, page, x, addr, c->offset);
local_irq_restore(flags);
}
@@ -1777,14 +1746,6 @@ static inline int slab_order(int size, i
int rem;
int min_order = slub_min_order;
- /*
- * If we would create too many object per slab then reduce
- * the slab order even if it goes below slub_min_order.
- */
- while (min_order > 0 &&
- (PAGE_SIZE << min_order) >= MAX_OBJECTS_PER_SLAB * size)
- min_order--;
-
for (order = max(min_order,
fls(min_objects * size - 1) - PAGE_SHIFT);
order <= max_order; order++) {
@@ -1799,9 +1760,6 @@ static inline int slab_order(int size, i
if (rem <= slab_size / fract_leftover)
break;
- /* If the next size is too high then exit now */
- if (slab_size * 2 >= MAX_OBJECTS_PER_SLAB * size)
- break;
}
return order;
@@ -1881,6 +1839,7 @@ static void init_kmem_cache_cpu(struct k
{
c->page = NULL;
c->freelist = NULL;
+ c->offset = s->offset / sizeof(void *);
c->node = 0;
}
@@ -2113,14 +2072,7 @@ static int calculate_sizes(struct kmem_c
*/
s->objects = (PAGE_SIZE << s->order) / size;
- /*
- * Verify that the number of objects is within permitted limits.
- * The page->inuse field is only 16 bit wide! So we cannot have
- * more than 64k objects per slab.
- */
- if (!s->objects || s->objects > MAX_OBJECTS_PER_SLAB)
- return 0;
- return 1;
+ return !!s->objects;
}
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
` (2 preceding siblings ...)
2007-08-23 6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
@ 2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
` (2 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
[-- Attachment #1: 0008-SLUB-Avoid-touching-page-struct-when-freeing-to-per.patch --]
[-- Type: text/plain, Size: 1334 bytes --]
Set c->node to -1 if we allocate from a debug slab instead for SlabDebug
which requires access the page struct cacheline.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/slub.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c 2007-08-22 17:20:28.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c 2007-08-22 17:20:33.000000000 -0700
@@ -1517,6 +1517,7 @@ debug:
c->page->inuse++;
c->page->freelist = object[c->offset];
+ c->node = -1;
slab_unlock(c->page);
return object;
}
@@ -1540,8 +1541,7 @@ static void __always_inline *slab_alloc(
local_irq_save(flags);
c = get_cpu_slab(s, smp_processor_id());
- if (unlikely(!c->page || !c->freelist ||
- !node_match(c, node)))
+ if (unlikely(!c->freelist || !node_match(c, node)))
object = __slab_alloc(s, gfpflags, node, addr, c);
@@ -1650,7 +1650,7 @@ static void __always_inline slab_free(st
local_irq_save(flags);
debug_check_no_locks_freed(object, s->objsize);
c = get_cpu_slab(s, smp_processor_id());
- if (likely(page == c->page && !SlabDebug(page))) {
+ if (likely(page == c->page && c->node >= 0)) {
object[c->offset] = c->freelist;
c->freelist = object;
} else
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
` (3 preceding siblings ...)
2007-08-23 6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
@ 2007-08-23 6:46 ` Christoph Lameter
2007-08-23 6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
[-- Attachment #1: 0009-SLUB-Place-kmem_cache_cpu-structures-in-a-NUMA-awar.patch --]
[-- Type: text/plain, Size: 9199 bytes --]
The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.
In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).
However, the kmem_cache_cpu structures can now be allocated in a more
intelligent way.
We would like to put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. We set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.
Pro:
- NUMA aware placement improves memory performance
- All global structures in struct kmem_cache become readonly
- Dense packing of per cpu structures reduces cacheline
footprint in SMP and NUMA.
- Potential avoidance of exclusive cacheline fetches
on the free and alloc hotpath since multiple kmem_cache_cpu
structures are in one cacheline. This is particularly important
for the kmalloc array.
Cons:
- Additional reference to one read only cacheline (per cpu
array of pointers to kmem_cache_cpu) in both slab_alloc()
and slab_free().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 9 +-
mm/slub.c | 162 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 154 insertions(+), 17 deletions(-)
Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h 2007-08-22 17:23:29.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h 2007-08-22 17:23:47.000000000 -0700
@@ -16,8 +16,7 @@ struct kmem_cache_cpu {
struct page *page;
int node;
unsigned int offset;
- /* Lots of wasted space */
-} ____cacheline_aligned_in_smp;
+};
struct kmem_cache_node {
spinlock_t list_lock; /* Protect partial list and nr_partial */
@@ -62,7 +61,11 @@ struct kmem_cache {
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
- struct kmem_cache_cpu cpu_slab[NR_CPUS];
+#ifdef CONFIG_SMP
+ struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+ struct kmem_cache_cpu cpu_slab;
+#endif
};
/*
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c 2007-08-22 17:23:40.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c 2007-08-22 17:23:47.000000000 -0700
@@ -276,7 +276,11 @@ static inline struct kmem_cache_node *ge
static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
{
- return &s->cpu_slab[cpu];
+#ifdef CONFIG_SMP
+ return s->cpu_slab[cpu];
+#else
+ return &s->cpu_slab;
+#endif
}
static inline int check_valid_pointer(struct kmem_cache *s,
@@ -1843,16 +1847,6 @@ static void init_kmem_cache_cpu(struct k
c->node = 0;
}
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
- int cpu;
-
- for_each_possible_cpu(cpu)
- init_kmem_cache_cpu(s, get_cpu_slab(s, cpu));
-
- return 1;
-}
-
static void init_kmem_cache_node(struct kmem_cache_node *n)
{
n->nr_partial = 0;
@@ -1864,6 +1858,125 @@ static void init_kmem_cache_node(struct
#endif
}
+#ifdef CONFIG_SMP
+/*
+ * Per cpu array for per cpu structures.
+ *
+ * The per cpu array places all kmem_cache_cpu structures from one processor
+ * close together meaning that it becomes possible that multiple per cpu
+ * structures are contained in one cacheline. This may be particularly
+ * beneficial for the kmalloc caches.
+ *
+ * A desktop system typically has around 60-80 slabs. With 100 here we are
+ * likely able to get per cpu structures for all caches from the array defined
+ * here. We must be able to cover all kmalloc caches during bootstrap.
+ *
+ * If the per cpu array is exhausted then fall back to kmalloc
+ * of individual cachelines. No sharing is possible then.
+ */
+#define NR_KMEM_CACHE_CPU 100
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu,
+ kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
+
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+ int cpu, gfp_t flags)
+{
+ struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
+
+ if (c)
+ per_cpu(kmem_cache_cpu_free, cpu) =
+ (void *)c->freelist;
+ else {
+ /* Table overflow: So allocate ourselves */
+ c = kmalloc_node(
+ ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
+ flags, cpu_to_node(cpu));
+ if (!c)
+ return NULL;
+ }
+
+ init_kmem_cache_cpu(s, c);
+ return c;
+}
+
+static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
+{
+ if (c < per_cpu(kmem_cache_cpu, cpu) ||
+ c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
+ kfree(c);
+ return;
+ }
+ c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
+ per_cpu(kmem_cache_cpu_free, cpu) = c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+ if (c) {
+ s->cpu_slab[cpu] = NULL;
+ free_kmem_cache_cpu(c, cpu);
+ }
+ }
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+ if (c)
+ continue;
+
+ c = alloc_kmem_cache_cpu(s, cpu, flags);
+ if (!c) {
+ free_kmem_cache_cpus(s);
+ return 0;
+ }
+ s->cpu_slab[cpu] = c;
+ }
+ return 1;
+}
+
+/*
+ * Initialize the per cpu array.
+ */
+static void init_alloc_cpu_cpu(int cpu)
+{
+ int i;
+
+ for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
+ free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
+}
+
+static void __init init_alloc_cpu(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ init_alloc_cpu_cpu(cpu);
+ }
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
+static inline void init_alloc_cpu(void) {}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+ init_kmem_cache_cpu(s, &s->cpu_slab);
+ return 1;
+}
+#endif
+
#ifdef CONFIG_NUMA
/*
* No kmalloc_node yet so do it by hand. We know that this is the first
@@ -1871,7 +1984,8 @@ static void init_kmem_cache_node(struct
* possible.
*
* Note that this function only works on the kmalloc_node_cache
- * when allocating for the kmalloc_node_cache.
+ * when allocating for the kmalloc_node_cache. This is used for bootstrapping
+ * memory on a fresh node that has no slab structures yet.
*/
static struct kmem_cache_node *early_kmem_cache_node_alloc(gfp_t gfpflags,
int node)
@@ -2102,6 +2216,7 @@ static int kmem_cache_open(struct kmem_c
if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
return 1;
+ free_kmem_cache_nodes(s);
error:
if (flags & SLAB_PANIC)
panic("Cannot create slab %s size=%lu realsize=%u "
@@ -2184,6 +2299,7 @@ static inline int kmem_cache_close(struc
flush_all(s);
/* Attempt to free all objects */
+ free_kmem_cache_cpus(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);
@@ -2580,6 +2696,8 @@ void __init kmem_cache_init(void)
slub_min_objects = DEFAULT_ANTIFRAG_MIN_OBJECTS;
}
+ init_alloc_cpu();
+
#ifdef CONFIG_NUMA
/*
* Must first have the slab cache available for the allocations of the
@@ -2640,10 +2758,12 @@ void __init kmem_cache_init(void)
#ifdef CONFIG_SMP
register_cpu_notifier(&slab_notifier);
+ kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+ nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+ kmem_size = sizeof(struct kmem_cache);
#endif
- kmem_size = offsetof(struct kmem_cache, cpu_slab) +
- nr_cpu_ids * sizeof(struct kmem_cache_cpu);
printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
" CPUs=%d, Nodes=%d\n",
@@ -2771,15 +2891,29 @@ static int __cpuinit slab_cpuup_callback
unsigned long flags;
switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ init_alloc_cpu_cpu(cpu);
+ down_read(&slub_lock);
+ list_for_each_entry(s, &slab_caches, list)
+ s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
+ GFP_KERNEL);
+ up_read(&slub_lock);
+ break;
+
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
local_irq_save(flags);
__flush_cpu_slab(s, cpu);
local_irq_restore(flags);
+ free_kmem_cache_cpu(c, cpu);
+ s->cpu_slab[cpu] = NULL;
}
up_read(&slub_lock);
break;
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* [patch 6/6] SLUB: Optimize cacheline use for zeroing
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
` (4 preceding siblings ...)
2007-08-23 6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
@ 2007-08-23 6:46 ` Christoph Lameter
2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2007-08-23 6:46 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg
[-- Attachment #1: 0010-SLUB-Optimize-cacheline-use-for-zeroing.patch --]
[-- Type: text/plain, Size: 2592 bytes --]
We touch a cacheline in the kmem_cache structure for zeroing to get the
size. However, the hot paths in slab_alloc and slab_free do not reference
any other fields in kmem_cache, so we may have to just bring in the
cacheline for this one access.
Add a new field to kmem_cache_cpu that contains the object size. That
cacheline must already be used in the hotpaths. So we save one cacheline
on every slab_alloc if we zero.
We need to update the kmem_cache_cpu object size if an aliasing operation
changes the objsize of an non debug slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 1 +
mm/slub.c | 14 ++++++++++++--
2 files changed, 13 insertions(+), 2 deletions(-)
Index: linux-2.6.23-rc3-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc3-mm1.orig/include/linux/slub_def.h 2007-08-22 17:23:47.000000000 -0700
+++ linux-2.6.23-rc3-mm1/include/linux/slub_def.h 2007-08-22 17:23:50.000000000 -0700
@@ -16,6 +16,7 @@ struct kmem_cache_cpu {
struct page *page;
int node;
unsigned int offset;
+ unsigned int objsize;
};
struct kmem_cache_node {
Index: linux-2.6.23-rc3-mm1/mm/slub.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/slub.c 2007-08-22 17:23:47.000000000 -0700
+++ linux-2.6.23-rc3-mm1/mm/slub.c 2007-08-22 17:23:50.000000000 -0700
@@ -1556,7 +1556,7 @@ static void __always_inline *slab_alloc(
local_irq_restore(flags);
if (unlikely((gfpflags & __GFP_ZERO) && object))
- memset(object, 0, s->objsize);
+ memset(object, 0, c->objsize);
return object;
}
@@ -1843,8 +1843,9 @@ static void init_kmem_cache_cpu(struct k
{
c->page = NULL;
c->freelist = NULL;
- c->offset = s->offset / sizeof(void *);
c->node = 0;
+ c->offset = s->offset / sizeof(void *);
+ c->objsize = s->objsize;
}
static void init_kmem_cache_node(struct kmem_cache_node *n)
@@ -2842,12 +2843,21 @@ struct kmem_cache *kmem_cache_create(con
down_write(&slub_lock);
s = find_mergeable(size, align, flags, ctor);
if (s) {
+ int cpu;
+
s->refcount++;
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc.
*/
s->objsize = max(s->objsize, (int)size);
+
+ /*
+ * And then we need to update the object size in the
+ * per cpu structures
+ */
+ for_each_online_cpu(cpu)
+ get_cpu_slab(s, cpu)->objsize = s->objsize;
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
up_write(&slub_lock);
if (sysfs_slab_alias(s, name))
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch 0/6] Per cpu structures for SLUB
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
` (5 preceding siblings ...)
2007-08-23 6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
@ 2007-08-24 21:38 ` Andrew Morton
2007-08-27 18:50 ` Christoph Lameter
6 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2007-08-24 21:38 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg
On Wed, 22 Aug 2007 23:46:53 -0700
Christoph Lameter <clameter@sgi.com> wrote:
> The following patchset introduces per cpu structures for SLUB. These
> are very small (and multiples of these may fit into one cacheline)
> and (apart from performance improvements) allow the addressing of
> several isues in SLUB:
>
> 1. The number of objects per slab is no longer limited to a 16 bit
> number.
>
> 2. Room is freed up in the page struct. We can avoid using the
> mapping field which allows to get rid of the #ifdef CONFIG_SLUB
> in page_mapping().
>
> 3. We will have an easier time adding new things like Peter Z.s reserve
> management.
>
> The RFC for this patchset was discussed on lkml a while ago:
>
> http://marc.info/?l=linux-kernel&m=118386677704534&w=2
>
> (And no this patchset does not include the use of cmpxchg_local that
> we discussed recently on lkml nor the cmpxchg implementation
> mentioned in the RFC)
>
> Performance
> -----------
>
>
> Norm = 2.6.23-rc3
> PCPU = Adds page allocator pass through plus per cpu structure patches
>
>
> IA64 8p 4n NUMA Altix
>
> Single threaded Concurrent Alloc
>
> Kmalloc Alloc/Free Kmalloc Alloc/Free
> Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
> -------------------------------------------------------------------
> 8 132 84 93 104 98 90 95 106
> 16 98 92 93 104 115 98 95 106
> 32 112 105 93 104 146 111 95 106
> 64 119 112 93 104 214 133 95 106
> 128 132 119 94 104 321 163 95 106
> 256+ 83255 176 106 115 415 224 108 117
> 512 191 176 106 115 487 341 108 117
> 1024 252 246 106 115 937 609 108 117
> 2048 308 292 107 115 2494 1207 108 117
> 4096 341 319 107 115 2497 1217 108 117
> 8192 402 380 107 115 2367 1188 108 117
> 16384* 560 474 106 434 4464 1904 108 478
>
> X86_64 2p SMP (Dual Core Pentium 940)
>
> Single threaded Concurrent Alloc
>
> Kmalloc Alloc/Free Kmalloc Alloc/Free
> Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
> --------------------------------------------------------------------
> 8 313 227 314 324 207 208 314 323
> 16 202 203 315 324 209 211 312 321
> 32 212 207 314 324 251 243 312 321
> 64 240 237 314 326 329 306 312 321
> 128 301 302 314 324 511 416 313 324
> 256 498 554 327 332 970 837 326 332
> 512 532 553 324 332 1025 932 326 335
> 1024 705 718 325 333 1489 1231 324 330
> 2048 764 767 324 334 2708 2175 324 332
> 4096* 1033 476 325 674 4727 782 324 678
I'm struggling a bit to understand these numbers. Bigger is better, I
assume? In what units are these numbers?
> Notes:
>
> Worst case:
> -----------
> We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
> since the processing overhead increases because we need to lookup
> the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
> So objects with the shortest lifetime possible. We would never use
> objects in that way but the measurement is important to show the worst
> case overhead created.
>
> Single Threaded:
> ----------------
> The single threaded kmalloc test shows behavior of a continual stream
> of allocation without contention. In the SMP case the losses are minimal.
> In the NUMA case we already have a winner there because the per cpu structure
> is placed local to the processor. So in the single threaded case we already
> win around 5% just by placing things better.
>
> Concurrent Alloc:
> -----------------
> We have varying gains up to a 50% on NUMA because we are now never updating
> a cacheline used by the other processor and the data structures are local
> to the processor.
>
> The SMP case shows gains but they are smaller (especially since
> this is the smallest SMP system possible.... 2 CPUs). So only up
> to 25%.
>
> Page allocator pass through
> ---------------------------
> There is a significant difference in the columns marked with a * because
> of the way that allocations for page sized objects are handled.
OK, but what happened to the third pair of columns (Concurrent Alloc,
Kmalloc) for 1024 and 2048-byte allocations? They seem to have become
significantly slower?
Thanks for running the numbers, but it's still a bit hard to work out
whether these changes are an aggregate benefit?
> If we handle
> the allocations in the slab allocator (Norm) then the alloc free tests
> results are superb since we can use the per cpu slab to just pass a pointer
> back and forth. The page allocator pass through (PCPU) shows that the page
> allocator may have problems with giving back the same page after a free.
> Or there something else in the page allocator that creates significant
> overhead compared to slab. Needs to be checked out I guess.
>
> However, the page allocator pass through is a win in the other cases
> since we can cut out the page allocator overhead. That is the more typical
> load of allocating a sequence of objects and we should optimize for that.
>
> (+ = Must be some cache artifact here or code crossing a TLB boundary.
> The result is reproducable)
>
Most Linux machines are uniprocessor. We should keep an eye on what effect
a change like this has on code size and performance for CONFIG_SMP=n
builds..
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch 0/6] Per cpu structures for SLUB
2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
@ 2007-08-27 18:50 ` Christoph Lameter
2007-08-27 23:51 ` Andrew Morton
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2007-08-27 18:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg
On Fri, 24 Aug 2007, Andrew Morton wrote:
> I'm struggling a bit to understand these numbers. Bigger is better, I
> assume? In what units are these numbers?
No less is better. These are cycle counts. Hmmm... We discussed these
cycle counts so much in the last week that I forgot to mention that.
> > Page allocator pass through
> > ---------------------------
> > There is a significant difference in the columns marked with a * because
> > of the way that allocations for page sized objects are handled.
>
> OK, but what happened to the third pair of columns (Concurrent Alloc,
> Kmalloc) for 1024 and 2048-byte allocations? They seem to have become
> significantly slower?
There is a significant performance increase there. That is the main point
of the patch.
> Thanks for running the numbers, but it's still a bit hard to work out
> whether these changes are an aggregate benefit?
There is a drawback because of the additional code introduced in the fast
path. However, the regular kmalloc case shows improvements throughout.
This is in particular of importance for SMP systems. We see an improvement
even for 2 processors.
> > If we handle
> > the allocations in the slab allocator (Norm) then the alloc free tests
> > results are superb since we can use the per cpu slab to just pass a pointer
> > back and forth. The page allocator pass through (PCPU) shows that the page
> > allocator may have problems with giving back the same page after a free.
> > Or there something else in the page allocator that creates significant
> > overhead compared to slab. Needs to be checked out I guess.
> >
> > However, the page allocator pass through is a win in the other cases
> > since we can cut out the page allocator overhead. That is the more typical
> > load of allocating a sequence of objects and we should optimize for that.
> >
> > (+ = Must be some cache artifact here or code crossing a TLB boundary.
> > The result is reproducable)
> >
>
> Most Linux machines are uniprocessor. We should keep an eye on what effect
> a change like this has on code size and performance for CONFIG_SMP=n
> builds..
There is an #ifdef around ther per cpu structure management code. All of
this will vanish (including the lookup of the per cpu address from the
fast path) if SMP is off.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch 0/6] Per cpu structures for SLUB
2007-08-27 18:50 ` Christoph Lameter
@ 2007-08-27 23:51 ` Andrew Morton
0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2007-08-27 23:51 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg
On Mon, 27 Aug 2007 11:50:10 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 24 Aug 2007, Andrew Morton wrote:
>
> > I'm struggling a bit to understand these numbers. Bigger is better, I
> > assume? In what units are these numbers?
>
> No less is better. These are cycle counts. Hmmm... We discussed these
> cycle counts so much in the last week that I forgot to mention that.
>
> > > Page allocator pass through
> > > ---------------------------
> > > There is a significant difference in the columns marked with a * because
> > > of the way that allocations for page sized objects are handled.
> >
> > OK, but what happened to the third pair of columns (Concurrent Alloc,
> > Kmalloc) for 1024 and 2048-byte allocations? They seem to have become
> > significantly slower?
>
> There is a significant performance increase there. That is the main point
> of the patch.
>
> > Thanks for running the numbers, but it's still a bit hard to work out
> > whether these changes are an aggregate benefit?
>
> There is a drawback because of the additional code introduced in the fast
> path. However, the regular kmalloc case shows improvements throughout.
> This is in particular of importance for SMP systems. We see an improvement
> even for 2 processors.
umm, OK. When you have time, could you please whizz up a clearer
changelog for this one?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-08-27 23:51 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-23 6:46 [patch 0/6] Per cpu structures for SLUB Christoph Lameter
2007-08-23 6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
2007-08-23 6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
2007-08-23 6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
2007-08-23 6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
2007-08-23 6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
2007-08-23 6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
2007-08-24 21:38 ` [patch 0/6] Per cpu structures for SLUB Andrew Morton
2007-08-27 18:50 ` Christoph Lameter
2007-08-27 23:51 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).