* [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-02 8:23 Alex Shi
2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Alex Shi @ 2011-12-02 8:23 UTC (permalink / raw)
To: cl, penberg; +Cc: linux-kernel, linux-mm
From: Alex Shi <alexs@intel.com>
Times performance regression were due to slub add to node partial head
or tail. That inspired me to do tunning on the node partial adding, to
set a criteria for head or tail position selection when do partial
adding.
My experiment show, when used objects is less than 1/4 total objects
of slub performance will get about 1.5% improvement on netperf loopback
testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
includes sandbridge or core2.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
mm/slub.c | 18 ++++++++----------
1 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..c419e80 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1461,14 +1461,13 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
*
* list_lock must be held.
*/
-static inline void add_partial(struct kmem_cache_node *n,
- struct page *page, int tail)
+static inline void add_partial(struct kmem_cache_node *n, struct page *page)
{
n->nr_partial++;
- if (tail == DEACTIVATE_TO_TAIL)
- list_add_tail(&page->lru, &n->partial);
- else
+ if (page->inuse <= page->objects / 4)
list_add(&page->lru, &n->partial);
+ else
+ list_add_tail(&page->lru, &n->partial);
}
/*
@@ -1829,7 +1828,7 @@ redo:
if (m == M_PARTIAL) {
- add_partial(n, page, tail);
+ add_partial(n, page);
stat(s, tail);
} else if (m == M_FULL) {
@@ -1904,8 +1903,7 @@ static void unfreeze_partials(struct kmem_cache *s)
if (l == M_PARTIAL)
remove_partial(n, page);
else
- add_partial(n, page,
- DEACTIVATE_TO_TAIL);
+ add_partial(n, page);
l = m;
}
@@ -2476,7 +2474,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
*/
if (unlikely(!prior)) {
remove_full(s, page);
- add_partial(n, page, DEACTIVATE_TO_TAIL);
+ add_partial(n, page);
stat(s, FREE_ADD_PARTIAL);
}
}
@@ -2793,7 +2791,7 @@ static void early_kmem_cache_node_alloc(int node)
init_kmem_cache_node(n, kmem_cache_node);
inc_slabs_node(kmem_cache_node, node, page->objects);
- add_partial(n, page, DEACTIVATE_TO_HEAD);
+ add_partial(n, page);
}
static void free_kmem_cache_nodes(struct kmem_cache *s)
--
1.7.0.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 39+ messages in thread* [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail 2011-12-02 8:23 [PATCH 1/3] slub: set a criteria for slub node partial adding Alex Shi @ 2011-12-02 8:23 ` Alex Shi 2011-12-02 8:23 ` [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter Alex Shi ` (2 more replies) 2011-12-02 11:36 ` [PATCH 1/3] slub: set a criteria for slub node partial adding Eric Dumazet 2011-12-02 14:43 ` Christoph Lameter 2 siblings, 3 replies; 39+ messages in thread From: Alex Shi @ 2011-12-02 8:23 UTC (permalink / raw) To: cl, penberg; +Cc: linux-kernel, linux-mm From: Alex Shi <alexs@intel.com> Since the head or tail were automaticly decided in add_partial(), we didn't need this statistics again. Signed-off-by: Alex Shi <alex.shi@intel.com> --- include/linux/slub_def.h | 2 -- mm/slub.c | 11 ++--------- 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index a32bcfd..509841a 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -29,8 +29,6 @@ enum stat_item { CPUSLAB_FLUSH, /* Abandoning of the cpu slab */ DEACTIVATE_FULL, /* Cpu slab was full when deactivated */ DEACTIVATE_EMPTY, /* Cpu slab was empty when deactivated */ - DEACTIVATE_TO_HEAD, /* Cpu slab was moved to the head of partials */ - DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */ DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */ DEACTIVATE_BYPASS, /* Implicit deactivation */ ORDER_FALLBACK, /* Number of times fallback was necessary */ diff --git a/mm/slub.c b/mm/slub.c index c419e80..65d901f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1717,13 +1717,11 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) enum slab_modes l = M_NONE, m = M_NONE; void *freelist; void *nextfree; - int tail = DEACTIVATE_TO_HEAD; struct page new; struct page old; if (page->freelist) { stat(s, DEACTIVATE_REMOTE_FREES); - tail = DEACTIVATE_TO_TAIL; } c->tid = next_tid(c->tid); @@ -1826,12 +1824,11 @@ redo: remove_full(s, page); - if (m == M_PARTIAL) { + if (m == M_PARTIAL) add_partial(n, page); - stat(s, tail); - } else if (m == M_FULL) { + else if (m == M_FULL) { stat(s, DEACTIVATE_FULL); add_full(s, n, page); @@ -5023,8 +5020,6 @@ STAT_ATTR(FREE_SLAB, free_slab); STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush); STAT_ATTR(DEACTIVATE_FULL, deactivate_full); STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty); -STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head); -STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail); STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees); STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass); STAT_ATTR(ORDER_FALLBACK, order_fallback); @@ -5088,8 +5083,6 @@ static struct attribute *slab_attrs[] = { &cpuslab_flush_attr.attr, &deactivate_full_attr.attr, &deactivate_empty_attr.attr, - &deactivate_to_head_attr.attr, - &deactivate_to_tail_attr.attr, &deactivate_remote_frees_attr.attr, &deactivate_bypass_attr.attr, &order_fallback_attr.attr, -- 1.7.0.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter 2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi @ 2011-12-02 8:23 ` Alex Shi 2011-12-02 14:44 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Christoph Lameter 2011-12-06 21:08 ` David Rientjes 2 siblings, 0 replies; 39+ messages in thread From: Alex Shi @ 2011-12-02 8:23 UTC (permalink / raw) To: cl, penberg; +Cc: linux-kernel, linux-mm From: Alex Shi <alexs@intel.com> Set selection criteria when fill per cpu partial in slow allocation path, and check the PCP left space before filling, even maybe the data from another CPU. The patch can bring another 1.5% performance increase on netperf loopback testing for our 4 or 2 sockets machines, include sandbridge, core2 Signed-off-by: Alex Shi <alex.shi@intel.com> --- mm/slub.c | 43 +++++++++++++++++++++++++++++++------------ 1 files changed, 31 insertions(+), 12 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 65d901f..72df387 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1542,25 +1542,44 @@ static void *get_partial_node(struct kmem_cache *s, spin_lock(&n->list_lock); list_for_each_entry_safe(page, page2, &n->partial, lru) { - void *t = acquire_slab(s, n, page, object == NULL); int available; + void *t; + struct page *oldpage; + int pobjects; - if (!t) - break; if (!object) { - c->page = page; - c->node = page_to_nid(page); - stat(s, ALLOC_FROM_PARTIAL); - object = t; - available = page->objects - page->inuse; + t = acquire_slab(s, n, page, object == NULL); + if (!t) + break; + else { + c->page = page; + c->node = page_to_nid(page); + stat(s, ALLOC_FROM_PARTIAL); + object = t; + } } else { - page->freelist = t; - available = put_cpu_partial(s, page, 0); + oldpage = this_cpu_read(s->cpu_slab->partial); + pobjects = oldpage ? oldpage->pobjects : 0; + + if (pobjects > s->cpu_partial / 2) + break; + + available = page->objects - page->inuse; + if (available >= s->cpu_partial / 4) { + t = acquire_slab(s, n, page, object == NULL); + if (!t) + break; + else { + page->freelist = t; + if (put_cpu_partial(s, page, 0) > + s->cpu_partial / 2) + break; + } + } } - if (kmem_cache_debug(s) || available > s->cpu_partial / 2) + if (kmem_cache_debug(s)) break; - } spin_unlock(&n->list_lock); return object; -- 1.7.0.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail 2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi 2011-12-02 8:23 ` [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter Alex Shi @ 2011-12-02 14:44 ` Christoph Lameter 2011-12-06 21:08 ` David Rientjes 2 siblings, 0 replies; 39+ messages in thread From: Christoph Lameter @ 2011-12-02 14:44 UTC (permalink / raw) To: Alex Shi; +Cc: penberg, linux-kernel, linux-mm On Fri, 2 Dec 2011, Alex Shi wrote: > Since the head or tail were automaticly decided in add_partial(), > we didn't need this statistics again. You need to update tools/slub/slabinfo.c as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail 2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi 2011-12-02 8:23 ` [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter Alex Shi 2011-12-02 14:44 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Christoph Lameter @ 2011-12-06 21:08 ` David Rientjes 2 siblings, 0 replies; 39+ messages in thread From: David Rientjes @ 2011-12-06 21:08 UTC (permalink / raw) To: Alex Shi; +Cc: cl, penberg, linux-kernel, linux-mm On Fri, 2 Dec 2011, Alex Shi wrote: > From: Alex Shi <alexs@intel.com> > > Since the head or tail were automaticly decided in add_partial(), > we didn't need this statistics again. > Umm, we shouldn't need to remove these statistics at all: if there is logic in add_partial() to determine whether to add it to the head or tail, the stats can still be incremented there appropriately. It would actually be helpful to cite those stats for your netperf benchmarking when determining whether patches should be merged or not. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 8:23 [PATCH 1/3] slub: set a criteria for slub node partial adding Alex Shi 2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi @ 2011-12-02 11:36 ` Eric Dumazet 2011-12-02 20:02 ` Christoph Lameter 2011-12-05 3:28 ` Alex,Shi 2011-12-02 14:43 ` Christoph Lameter 2 siblings, 2 replies; 39+ messages in thread From: Eric Dumazet @ 2011-12-02 11:36 UTC (permalink / raw) To: Alex Shi; +Cc: cl, penberg, linux-kernel, linux-mm Le vendredi 02 dA(C)cembre 2011 A 16:23 +0800, Alex Shi a A(C)crit : > From: Alex Shi <alexs@intel.com> > > Times performance regression were due to slub add to node partial head > or tail. That inspired me to do tunning on the node partial adding, to > set a criteria for head or tail position selection when do partial > adding. > My experiment show, when used objects is less than 1/4 total objects > of slub performance will get about 1.5% improvement on netperf loopback > testing with 2048 clients, wherever on our 4 or 2 sockets platforms, > includes sandbridge or core2. > > Signed-off-by: Alex Shi <alex.shi@intel.com> > --- > mm/slub.c | 18 ++++++++---------- > 1 files changed, 8 insertions(+), 10 deletions(-) > netperf (loopback or ethernet) is a known stress test for slub, and your patch removes code that might hurt netperf, but benefit real workload. Have you tried instead this far less intrusive solution ? if (tail == DEACTIVATE_TO_TAIL || page->inuse > page->objects / 4) list_add_tail(&page->lru, &n->partial); else list_add(&page->lru, &n->partial); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 11:36 ` [PATCH 1/3] slub: set a criteria for slub node partial adding Eric Dumazet @ 2011-12-02 20:02 ` Christoph Lameter 2011-12-05 2:21 ` Shaohua Li 2011-12-05 10:01 ` Alex,Shi 2011-12-05 3:28 ` Alex,Shi 1 sibling, 2 replies; 39+ messages in thread From: Christoph Lameter @ 2011-12-02 20:02 UTC (permalink / raw) To: Eric Dumazet; +Cc: Alex Shi, penberg, linux-kernel, linux-mm On Fri, 2 Dec 2011, Eric Dumazet wrote: > netperf (loopback or ethernet) is a known stress test for slub, and your > patch removes code that might hurt netperf, but benefit real workload. > > Have you tried instead this far less intrusive solution ? > > if (tail == DEACTIVATE_TO_TAIL || > page->inuse > page->objects / 4) > list_add_tail(&page->lru, &n->partial); > else > list_add(&page->lru, &n->partial); One could also move this logic to reside outside of the call to add_partial(). This is called mostly from __slab_free() so the logic could be put in there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 20:02 ` Christoph Lameter @ 2011-12-05 2:21 ` Shaohua Li 2011-12-05 10:01 ` Alex,Shi 1 sibling, 0 replies; 39+ messages in thread From: Shaohua Li @ 2011-12-05 2:21 UTC (permalink / raw) To: Christoph Lameter Cc: Eric Dumazet, Shi, Alex, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 2011-12-03 at 04:02 +0800, Christoph Lameter wrote: > On Fri, 2 Dec 2011, Eric Dumazet wrote: > > > netperf (loopback or ethernet) is a known stress test for slub, and your > > patch removes code that might hurt netperf, but benefit real workload. > > > > Have you tried instead this far less intrusive solution ? > > > > if (tail == DEACTIVATE_TO_TAIL || > > page->inuse > page->objects / 4) > > list_add_tail(&page->lru, &n->partial); > > else > > list_add(&page->lru, &n->partial); > > One could also move this logic to reside outside of the call to > add_partial(). This is called mostly from __slab_free() so the logic could > be put in there. I'm wondering where the improvement comes from. The new added partial page almost always has few free objects (the inuse < objects/4 isn't popular I thought), that's why we add it to list tail. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 20:02 ` Christoph Lameter 2011-12-05 2:21 ` Shaohua Li @ 2011-12-05 10:01 ` Alex,Shi 1 sibling, 0 replies; 39+ messages in thread From: Alex,Shi @ 2011-12-05 10:01 UTC (permalink / raw) To: Christoph Lameter Cc: Eric Dumazet, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat, 2011-12-03 at 04:02 +0800, Christoph Lameter wrote: > On Fri, 2 Dec 2011, Eric Dumazet wrote: > > > netperf (loopback or ethernet) is a known stress test for slub, and your > > patch removes code that might hurt netperf, but benefit real workload. > > > > Have you tried instead this far less intrusive solution ? > > > > if (tail == DEACTIVATE_TO_TAIL || > > page->inuse > page->objects / 4) > > list_add_tail(&page->lru, &n->partial); > > else > > list_add(&page->lru, &n->partial); > > One could also move this logic to reside outside of the call to > add_partial(). This is called mostly from __slab_free() so the logic could > be put in there. > After pcp adding, add_partial just be used in put_cpu_partial -> unfreeze_partial without debug setting. If we need to do change, guess it's better in this function. BTW I collection some data with my PCP statistics patch. I will be very glad if you like it. [alexs@lkp-ne04 ~]$ sudo grep . /sys/kernel/slab/kmalloc-256/* /sys/kernel/slab/kmalloc-256/alloc_from_partial:4955645 /sys/kernel/slab/kmalloc-256/alloc_from_pcp:6753981 ... /sys/kernel/slab/kmalloc-256/pcp_from_free:11743977 /sys/kernel/slab/kmalloc-256/pcp_from_node:5948883 ... /sys/kernel/slab/kmalloc-256/unfreeze_pcp:834262 -------------- >From aa754e20b81cb9f5ab63800a084858d25c18db31 Mon Sep 17 00:00:00 2001 From: Alex shi <alex.shi@intel.com> Date: Tue, 6 Dec 2011 01:49:16 +0800 Subject: [PATCH] slub: per cpu partial statistics collection PCP statistics were not collected in detail now. Add and change some variables for this. changed: cpu_partial_alloc --> alloc_from_pcp, cpu_partial_free --> pcp_from_free, /* pcp refilled from slab free */ added: pcp_from_node, /* pcp refilled from node partial */ unfreeze_pcp, /* unfreeze pcp */ Signed-off-by: Alex Shi <alex.shi@intel.com> --- include/linux/slub_def.h | 8 +++++--- mm/slub.c | 22 ++++++++++++++-------- tools/slub/slabinfo.c | 12 ++++++------ 3 files changed, 25 insertions(+), 17 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index a32bcfd..1c2669b 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -21,7 +21,7 @@ enum stat_item { FREE_FROZEN, /* Freeing to frozen slab */ FREE_ADD_PARTIAL, /* Freeing moves slab to partial list */ FREE_REMOVE_PARTIAL, /* Freeing removes last object */ - ALLOC_FROM_PARTIAL, /* Cpu slab acquired from partial list */ + ALLOC_FROM_PARTIAL, /* Cpu slab acquired from node partial list */ ALLOC_SLAB, /* Cpu slab acquired from page allocator */ ALLOC_REFILL, /* Refill cpu slab from slab freelist */ ALLOC_NODE_MISMATCH, /* Switching cpu slab */ @@ -36,8 +36,10 @@ enum stat_item { ORDER_FALLBACK, /* Number of times fallback was necessary */ CMPXCHG_DOUBLE_CPU_FAIL,/* Failure of this_cpu_cmpxchg_double */ CMPXCHG_DOUBLE_FAIL, /* Number of times that cmpxchg double did not match */ - CPU_PARTIAL_ALLOC, /* Used cpu partial on alloc */ - CPU_PARTIAL_FREE, /* USed cpu partial on free */ + ALLOC_FROM_PCP, /* Used cpu partial on alloc */ + PCP_FROM_FREE, /* Fill cpu partial from free */ + PCP_FROM_NODE, /* Fill cpu partial from node partial */ + UNFREEZE_PCP, /* Unfreeze per cpu partial */ NR_SLUB_STAT_ITEMS }; struct kmem_cache_cpu { diff --git a/mm/slub.c b/mm/slub.c index ed3334d..5843846 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1558,6 +1558,7 @@ static void *get_partial_node(struct kmem_cache *s, } else { page->freelist = t; available = put_cpu_partial(s, page, 0); + stat(s, PCP_FROM_NODE); } if (kmem_cache_debug(s) || available > s->cpu_partial / 2) break; @@ -1968,6 +1969,7 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) local_irq_restore(flags); pobjects = 0; pages = 0; + stat(s, UNFREEZE_PCP); } } @@ -1979,7 +1981,6 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) page->next = oldpage; } while (irqsafe_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); - stat(s, CPU_PARTIAL_FREE); return pobjects; } @@ -2212,7 +2213,7 @@ new_slab: c->page = c->partial; c->partial = c->page->next; c->node = page_to_nid(c->page); - stat(s, CPU_PARTIAL_ALLOC); + stat(s, ALLOC_FROM_PCP); c->freelist = NULL; goto redo; } @@ -2448,9 +2449,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * If we just froze the page then put it onto the * per cpu partial list. */ - if (new.frozen && !was_frozen) + if (new.frozen && !was_frozen) { put_cpu_partial(s, page, 1); - + stat(s, PCP_FROM_FREE); + } /* * The list lock was not taken therefore no list * activity can be necessary. @@ -5032,8 +5034,10 @@ STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass); STAT_ATTR(ORDER_FALLBACK, order_fallback); STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail); STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail); -STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc); -STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free); +STAT_ATTR(ALLOC_FROM_PCP, alloc_from_pcp); +STAT_ATTR(PCP_FROM_FREE, pcp_from_free); +STAT_ATTR(PCP_FROM_NODE, pcp_from_node); +STAT_ATTR(UNFREEZE_PCP, unfreeze_pcp); #endif static struct attribute *slab_attrs[] = { @@ -5097,8 +5101,10 @@ static struct attribute *slab_attrs[] = { &order_fallback_attr.attr, &cmpxchg_double_fail_attr.attr, &cmpxchg_double_cpu_fail_attr.attr, - &cpu_partial_alloc_attr.attr, - &cpu_partial_free_attr.attr, + &alloc_from_pcp_attr.attr, + &pcp_from_free_attr.attr, + &pcp_from_node_attr.attr, + &unfreeze_pcp_attr.attr, #endif #ifdef CONFIG_FAILSLAB &failslab_attr.attr, diff --git a/tools/slub/slabinfo.c b/tools/slub/slabinfo.c index 164cbcf..d8f67f0 100644 --- a/tools/slub/slabinfo.c +++ b/tools/slub/slabinfo.c @@ -42,7 +42,7 @@ struct slabinfo { unsigned long deactivate_remote_frees, order_fallback; unsigned long cmpxchg_double_cpu_fail, cmpxchg_double_fail; unsigned long alloc_node_mismatch, deactivate_bypass; - unsigned long cpu_partial_alloc, cpu_partial_free; + unsigned long alloc_from_pcp, pcp_from_free; int numa[MAX_NODES]; int numa_partial[MAX_NODES]; } slabinfo[MAX_SLABS]; @@ -457,9 +457,9 @@ static void slab_stats(struct slabinfo *s) s->free_remove_partial * 100 / total_free); printf("Cpu partial list %8lu %8lu %3lu %3lu\n", - s->cpu_partial_alloc, s->cpu_partial_free, - s->cpu_partial_alloc * 100 / total_alloc, - s->cpu_partial_free * 100 / total_free); + s->alloc_from_pcp, s->pcp_from_free, + s->alloc_from_pcp * 100 / total_alloc, + s->pcp_from_free * 100 / total_free); printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n", s->deactivate_remote_frees, s->free_frozen, @@ -1215,8 +1215,8 @@ static void read_slab_dir(void) slab->order_fallback = get_obj("order_fallback"); slab->cmpxchg_double_cpu_fail = get_obj("cmpxchg_double_cpu_fail"); slab->cmpxchg_double_fail = get_obj("cmpxchg_double_fail"); - slab->cpu_partial_alloc = get_obj("cpu_partial_alloc"); - slab->cpu_partial_free = get_obj("cpu_partial_free"); + slab->alloc_from_pcp = get_obj("alloc_from_pcp"); + slab->pcp_from_free = get_obj("pcp_from_free"); slab->alloc_node_mismatch = get_obj("alloc_node_mismatch"); slab->deactivate_bypass = get_obj("deactivate_bypass"); chdir(".."); -- 1.7.0.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 11:36 ` [PATCH 1/3] slub: set a criteria for slub node partial adding Eric Dumazet 2011-12-02 20:02 ` Christoph Lameter @ 2011-12-05 3:28 ` Alex,Shi 1 sibling, 0 replies; 39+ messages in thread From: Alex,Shi @ 2011-12-05 3:28 UTC (permalink / raw) To: Eric Dumazet Cc: cl@linux.com, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 2011-12-02 at 19:36 +0800, Eric Dumazet wrote: > Le vendredi 02 dA(C)cembre 2011 A 16:23 +0800, Alex Shi a A(C)crit : > > From: Alex Shi <alexs@intel.com> > > > > Times performance regression were due to slub add to node partial head > > or tail. That inspired me to do tunning on the node partial adding, to > > set a criteria for head or tail position selection when do partial > > adding. > > My experiment show, when used objects is less than 1/4 total objects > > of slub performance will get about 1.5% improvement on netperf loopback > > testing with 2048 clients, wherever on our 4 or 2 sockets platforms, > > includes sandbridge or core2. > > > > Signed-off-by: Alex Shi <alex.shi@intel.com> > > --- > > mm/slub.c | 18 ++++++++---------- > > 1 files changed, 8 insertions(+), 10 deletions(-) > > > > netperf (loopback or ethernet) is a known stress test for slub, and your > patch removes code that might hurt netperf, but benefit real workload. > > Have you tried instead this far less intrusive solution ? > > if (tail == DEACTIVATE_TO_TAIL || > page->inuse > page->objects / 4) > list_add_tail(&page->lru, &n->partial); > else > list_add(&page->lru, &n->partial); For loopback netperf, it has no clear performance change on all platforms. For hackbench testing, it has a bit worse on 2P NHM 0.5~1%, but it is helpful to increase about 2% on 4P(8cores * 2SMT) NHM machine. I was thought no much cache effect on hot or cold after per cpu partial adding. but seems for hackbench, node partial still has much effect. > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 8:23 [PATCH 1/3] slub: set a criteria for slub node partial adding Alex Shi 2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi 2011-12-02 11:36 ` [PATCH 1/3] slub: set a criteria for slub node partial adding Eric Dumazet @ 2011-12-02 14:43 ` Christoph Lameter 2011-12-05 9:22 ` Alex,Shi 2011-12-09 8:30 ` Alex,Shi 2 siblings, 2 replies; 39+ messages in thread From: Christoph Lameter @ 2011-12-02 14:43 UTC (permalink / raw) To: Alex Shi; +Cc: penberg, linux-kernel, linux-mm On Fri, 2 Dec 2011, Alex Shi wrote: > From: Alex Shi <alexs@intel.com> > > Times performance regression were due to slub add to node partial head > or tail. That inspired me to do tunning on the node partial adding, to > set a criteria for head or tail position selection when do partial > adding. > My experiment show, when used objects is less than 1/4 total objects > of slub performance will get about 1.5% improvement on netperf loopback > testing with 2048 clients, wherever on our 4 or 2 sockets platforms, > includes sandbridge or core2. The number of free objects in a slab may have nothing to do with cache hotness of all objects in the slab. You can only be sure that one object (the one that was freed) is cache hot. Netperf may use them in sequence and therefore you are likely to get series of frees on the same slab page. How are other benchmarks affected by this change? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 14:43 ` Christoph Lameter @ 2011-12-05 9:22 ` Alex,Shi 2011-12-06 21:06 ` David Rientjes 2011-12-09 8:30 ` Alex,Shi 1 sibling, 1 reply; 39+ messages in thread From: Alex,Shi @ 2011-12-05 9:22 UTC (permalink / raw) To: Christoph Lameter Cc: penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Fri, 2011-12-02 at 22:43 +0800, Christoph Lameter wrote: > On Fri, 2 Dec 2011, Alex Shi wrote: > > > From: Alex Shi <alexs@intel.com> > > > > Times performance regression were due to slub add to node partial head > > or tail. That inspired me to do tunning on the node partial adding, to > > set a criteria for head or tail position selection when do partial > > adding. > > My experiment show, when used objects is less than 1/4 total objects > > of slub performance will get about 1.5% improvement on netperf loopback > > testing with 2048 clients, wherever on our 4 or 2 sockets platforms, > > includes sandbridge or core2. > > The number of free objects in a slab may have nothing to do with cache > hotness of all objects in the slab. You can only be sure that one object > (the one that was freed) is cache hot. Netperf may use them in sequence > and therefore you are likely to get series of frees on the same slab > page. How are other benchmarks affected by this change? Previous testing depends on 3.2-rc1, that show hackbench performance has no clear change, and netperf get some benefit. But seems after irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting these results. As to the cache hot benefit, my understanding is that if the same object was reused, it contents will be refilled from memory anyway. but it will save a CPU cache line replace action. But think through the lock contention on node->list_lock, like explanation of commit 130655ef0979. more free objects will reduce the contentions of this lock. It is some tricks to do balance of them. :( -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-05 9:22 ` Alex,Shi @ 2011-12-06 21:06 ` David Rientjes 2011-12-07 5:11 ` Shaohua Li 0 siblings, 1 reply; 39+ messages in thread From: David Rientjes @ 2011-12-06 21:06 UTC (permalink / raw) To: Alex,Shi Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Mon, 5 Dec 2011, Alex,Shi wrote: > Previous testing depends on 3.2-rc1, that show hackbench performance has > no clear change, and netperf get some benefit. But seems after > irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting > these results. > netperf will also degrade with this change on some machines, there's no clear heuristic that can be used to benefit all workloads when deciding where to add a partial slab into the list. Cache hotness is great but your patch doesn't address situations where frees happen to a partial slab such that they may be entirely free (or at least below your 1:4 inuse to nr_objs threshold) at the time you want to deactivate the cpu slab. I had a patchset that iterated the partial list and found the "most free" partial slab (and terminated prematurely if a threshold had been reached, much like yours) and selected that one, and it helped netperf 2-3% in my testing. So I disagree with determining where to add a partial slab to the list at the time of free because it doesn't infer its state at the time of cpu slab deactivation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-06 21:06 ` David Rientjes @ 2011-12-07 5:11 ` Shaohua Li 2011-12-07 7:28 ` David Rientjes 0 siblings, 1 reply; 39+ messages in thread From: Shaohua Li @ 2011-12-07 5:11 UTC (permalink / raw) To: David Rientjes Cc: Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Wed, 2011-12-07 at 05:06 +0800, David Rientjes wrote: > On Mon, 5 Dec 2011, Alex,Shi wrote: > > > Previous testing depends on 3.2-rc1, that show hackbench performance has > > no clear change, and netperf get some benefit. But seems after > > irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting > > these results. > > > > netperf will also degrade with this change on some machines, there's no > clear heuristic that can be used to benefit all workloads when deciding > where to add a partial slab into the list. Cache hotness is great but > your patch doesn't address situations where frees happen to a partial slab > such that they may be entirely free (or at least below your 1:4 inuse to > nr_objs threshold) at the time you want to deactivate the cpu slab. > > I had a patchset that iterated the partial list and found the "most free" > partial slab (and terminated prematurely if a threshold had been reached, > much like yours) and selected that one, and it helped netperf 2-3% in my > testing. So I disagree with determining where to add a partial slab to > the list at the time of free because it doesn't infer its state at the > time of cpu slab deactivation. interesting. I did similar experiment before (try to sort the page according to free number), but it appears quite hard. The free number of a page is dynamic, eg more slabs can be freed when the page is in partial list. And in netperf test, the partial list could be very very long. Can you post your patch, I definitely what to look at it. What I have about the partial list is it wastes a lot of memory. My test shows about 50% memory is wasted. I'm thinking not always fetching the oldest page from the partial list, because chances that objects of oldest page can all be freed is high. I haven't done any test yet, wondering if it could be helpful. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-07 5:11 ` Shaohua Li @ 2011-12-07 7:28 ` David Rientjes 2011-12-12 2:43 ` Shaohua Li 0 siblings, 1 reply; 39+ messages in thread From: David Rientjes @ 2011-12-07 7:28 UTC (permalink / raw) To: Shaohua Li Cc: Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Wed, 7 Dec 2011, Shaohua Li wrote: > interesting. I did similar experiment before (try to sort the page > according to free number), but it appears quite hard. The free number of > a page is dynamic, eg more slabs can be freed when the page is in > partial list. And in netperf test, the partial list could be very very > long. Can you post your patch, I definitely what to look at it. It was over a couple of years ago and the slub code has changed significantly since then, but you can see the general concept of the "slab thrashing" problem with netperf and my solution back then: http://marc.info/?l=linux-kernel&m=123839191416478 http://marc.info/?l=linux-kernel&m=123839203016592 http://marc.info/?l=linux-kernel&m=123839202916583 I also had a separate patchset that, instead of this approach, would just iterate through the partial list in get_partial_node() looking for anything where the number of free objects met a certain threshold, which still defaulted to 25% and instantly picked it. The overhead was taking slab_lock() for each page, but that was nullified by the performance speedup of using the alloc fastpath a majority of the time for both kmalloc-256 and kmalloc-2k when in the past it had only been able to serve one or two allocs. If no partial slab met the threshold, the slab_lock() is held of the partial slab with the most free objects and returned instead. > What I have about the partial list is it wastes a lot of memory. That's not going to be helped with the above approach since we typically try to fill a partial slab with many free objects, but it also won't be severely impacted because if the threshold is kept small enough, then we simply return the first partial slab that meets the criteria. That allows the partial slabs at the end of the list to hopefully become mostly free. And, for completeness, there's also a possibility that you have some completely free slabs on the partial list that coule be freed back to the buddy allocator by decreasing min_partial by way of /sys/kernel/slab/cache/min_partial at the risk of performance and then invoke /sys/kernel/slab/cache/shrink to free the unused slabs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-07 7:28 ` David Rientjes @ 2011-12-12 2:43 ` Shaohua Li 2011-12-12 4:14 ` Alex,Shi ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Shaohua Li @ 2011-12-12 2:43 UTC (permalink / raw) To: David Rientjes Cc: Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote: > On Wed, 7 Dec 2011, Shaohua Li wrote: > > > interesting. I did similar experiment before (try to sort the page > > according to free number), but it appears quite hard. The free number of > > a page is dynamic, eg more slabs can be freed when the page is in > > partial list. And in netperf test, the partial list could be very very > > long. Can you post your patch, I definitely what to look at it. > > It was over a couple of years ago and the slub code has changed > significantly since then, but you can see the general concept of the "slab > thrashing" problem with netperf and my solution back then: > > http://marc.info/?l=linux-kernel&m=123839191416478 > http://marc.info/?l=linux-kernel&m=123839203016592 > http://marc.info/?l=linux-kernel&m=123839202916583 > > I also had a separate patchset that, instead of this approach, would just > iterate through the partial list in get_partial_node() looking for > anything where the number of free objects met a certain threshold, which > still defaulted to 25% and instantly picked it. The overhead was taking > slab_lock() for each page, but that was nullified by the performance > speedup of using the alloc fastpath a majority of the time for both > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve > one or two allocs. If no partial slab met the threshold, the slab_lock() > is held of the partial slab with the most free objects and returned > instead. With the per-cpu partial list, I didn't see any workload which is still suffering from the list lock, so I suppose both the trashing approach and pick 25% used slab approach don't help. The per-cpu partial list flushes the whole per-cpu partial list after s->cpu_partial objects are freed, this is a little aggressive, because the per-cpu partial list need refilled again soon after an allocation. I had experiment to have separate per-cpu alloc/free partial list, which can avoid this. but again, I didn't see any workload still suffering list lock issue even with netperf which stress slub much. did you see such workload? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 2:43 ` Shaohua Li @ 2011-12-12 4:14 ` Alex,Shi 2011-12-12 4:35 ` Shaohua Li 2011-12-12 6:09 ` Eric Dumazet 2011-12-14 1:29 ` David Rientjes 2 siblings, 1 reply; 39+ messages in thread From: Alex,Shi @ 2011-12-12 4:14 UTC (permalink / raw) To: Li, Shaohua Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote: > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote: > > On Wed, 7 Dec 2011, Shaohua Li wrote: > > > > > interesting. I did similar experiment before (try to sort the page > > > according to free number), but it appears quite hard. The free number of > > > a page is dynamic, eg more slabs can be freed when the page is in > > > partial list. And in netperf test, the partial list could be very very > > > long. Can you post your patch, I definitely what to look at it. > > > > It was over a couple of years ago and the slub code has changed > > significantly since then, but you can see the general concept of the "slab > > thrashing" problem with netperf and my solution back then: > > > > http://marc.info/?l=linux-kernel&m=123839191416478 > > http://marc.info/?l=linux-kernel&m=123839203016592 > > http://marc.info/?l=linux-kernel&m=123839202916583 > > > > I also had a separate patchset that, instead of this approach, would just > > iterate through the partial list in get_partial_node() looking for > > anything where the number of free objects met a certain threshold, which > > still defaulted to 25% and instantly picked it. The overhead was taking > > slab_lock() for each page, but that was nullified by the performance > > speedup of using the alloc fastpath a majority of the time for both > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve > > one or two allocs. If no partial slab met the threshold, the slab_lock() > > is held of the partial slab with the most free objects and returned > > instead. > With the per-cpu partial list, I didn't see any workload which is still > suffering from the list lock, The merge error that you fixed in 3.2-rc1 for hackbench regression is due to add slub to node partial head. And data of hackbench show node partial is still heavy used in allocation. /sys/kernel/slab/kmalloc-256/alloc_fastpath:225208640 /sys/kernel/slab/kmalloc-256/alloc_from_partial:5276300 /sys/kernel/slab/kmalloc-256/alloc_from_pcp:8326041 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 4:14 ` Alex,Shi @ 2011-12-12 4:35 ` Shaohua Li 2011-12-12 4:25 ` Alex,Shi 0 siblings, 1 reply; 39+ messages in thread From: Shaohua Li @ 2011-12-12 4:35 UTC (permalink / raw) To: Shi, Alex Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote: > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote: > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote: > > > On Wed, 7 Dec 2011, Shaohua Li wrote: > > > > > > > interesting. I did similar experiment before (try to sort the page > > > > according to free number), but it appears quite hard. The free number of > > > > a page is dynamic, eg more slabs can be freed when the page is in > > > > partial list. And in netperf test, the partial list could be very very > > > > long. Can you post your patch, I definitely what to look at it. > > > > > > It was over a couple of years ago and the slub code has changed > > > significantly since then, but you can see the general concept of the "slab > > > thrashing" problem with netperf and my solution back then: > > > > > > http://marc.info/?l=linux-kernel&m=123839191416478 > > > http://marc.info/?l=linux-kernel&m=123839203016592 > > > http://marc.info/?l=linux-kernel&m=123839202916583 > > > > > > I also had a separate patchset that, instead of this approach, would just > > > iterate through the partial list in get_partial_node() looking for > > > anything where the number of free objects met a certain threshold, which > > > still defaulted to 25% and instantly picked it. The overhead was taking > > > slab_lock() for each page, but that was nullified by the performance > > > speedup of using the alloc fastpath a majority of the time for both > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve > > > one or two allocs. If no partial slab met the threshold, the slab_lock() > > > is held of the partial slab with the most free objects and returned > > > instead. > > With the per-cpu partial list, I didn't see any workload which is still > > suffering from the list lock, > > The merge error that you fixed in 3.2-rc1 for hackbench regression is > due to add slub to node partial head. And data of hackbench show node > partial is still heavy used in allocation. The patch is already in base kernel, did you mean even with it you still saw the list locking issue with latest kernel? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 4:35 ` Shaohua Li @ 2011-12-12 4:25 ` Alex,Shi 2011-12-12 4:48 ` Shaohua Li 0 siblings, 1 reply; 39+ messages in thread From: Alex,Shi @ 2011-12-12 4:25 UTC (permalink / raw) To: Li, Shaohua Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Mon, 2011-12-12 at 12:35 +0800, Li, Shaohua wrote: > On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote: > > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote: > > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote: > > > > On Wed, 7 Dec 2011, Shaohua Li wrote: > > > > > > > > > interesting. I did similar experiment before (try to sort the page > > > > > according to free number), but it appears quite hard. The free number of > > > > > a page is dynamic, eg more slabs can be freed when the page is in > > > > > partial list. And in netperf test, the partial list could be very very > > > > > long. Can you post your patch, I definitely what to look at it. > > > > > > > > It was over a couple of years ago and the slub code has changed > > > > significantly since then, but you can see the general concept of the "slab > > > > thrashing" problem with netperf and my solution back then: > > > > > > > > http://marc.info/?l=linux-kernel&m=123839191416478 > > > > http://marc.info/?l=linux-kernel&m=123839203016592 > > > > http://marc.info/?l=linux-kernel&m=123839202916583 > > > > > > > > I also had a separate patchset that, instead of this approach, would just > > > > iterate through the partial list in get_partial_node() looking for > > > > anything where the number of free objects met a certain threshold, which > > > > still defaulted to 25% and instantly picked it. The overhead was taking > > > > slab_lock() for each page, but that was nullified by the performance > > > > speedup of using the alloc fastpath a majority of the time for both > > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve > > > > one or two allocs. If no partial slab met the threshold, the slab_lock() > > > > is held of the partial slab with the most free objects and returned > > > > instead. > > > With the per-cpu partial list, I didn't see any workload which is still > > > suffering from the list lock, > > > > The merge error that you fixed in 3.2-rc1 for hackbench regression is > > due to add slub to node partial head. And data of hackbench show node > > partial is still heavy used in allocation. > The patch is already in base kernel, did you mean even with it you still > saw the list locking issue with latest kernel? > Yes, list_lock still hurt performance. It will be helpful if you can do some optimize for it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 4:25 ` Alex,Shi @ 2011-12-12 4:48 ` Shaohua Li 2011-12-12 6:17 ` Alex,Shi 0 siblings, 1 reply; 39+ messages in thread From: Shaohua Li @ 2011-12-12 4:48 UTC (permalink / raw) To: Alex,Shi Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Mon, 2011-12-12 at 12:25 +0800, Alex,Shi wrote: > On Mon, 2011-12-12 at 12:35 +0800, Li, Shaohua wrote: > > On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote: > > > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote: > > > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote: > > > > > On Wed, 7 Dec 2011, Shaohua Li wrote: > > > > > > > > > > > interesting. I did similar experiment before (try to sort the page > > > > > > according to free number), but it appears quite hard. The free number of > > > > > > a page is dynamic, eg more slabs can be freed when the page is in > > > > > > partial list. And in netperf test, the partial list could be very very > > > > > > long. Can you post your patch, I definitely what to look at it. > > > > > > > > > > It was over a couple of years ago and the slub code has changed > > > > > significantly since then, but you can see the general concept of the "slab > > > > > thrashing" problem with netperf and my solution back then: > > > > > > > > > > http://marc.info/?l=linux-kernel&m=123839191416478 > > > > > http://marc.info/?l=linux-kernel&m=123839203016592 > > > > > http://marc.info/?l=linux-kernel&m=123839202916583 > > > > > > > > > > I also had a separate patchset that, instead of this approach, would just > > > > > iterate through the partial list in get_partial_node() looking for > > > > > anything where the number of free objects met a certain threshold, which > > > > > still defaulted to 25% and instantly picked it. The overhead was taking > > > > > slab_lock() for each page, but that was nullified by the performance > > > > > speedup of using the alloc fastpath a majority of the time for both > > > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve > > > > > one or two allocs. If no partial slab met the threshold, the slab_lock() > > > > > is held of the partial slab with the most free objects and returned > > > > > instead. > > > > With the per-cpu partial list, I didn't see any workload which is still > > > > suffering from the list lock, > > > > > > The merge error that you fixed in 3.2-rc1 for hackbench regression is > > > due to add slub to node partial head. And data of hackbench show node > > > partial is still heavy used in allocation. > > The patch is already in base kernel, did you mean even with it you still > > saw the list locking issue with latest kernel? > > > > Yes, list_lock still hurt performance. It will be helpful if you can do > some optimize for it. please post data and the workload. In my test, I didn't see the locking takes significant time with perf. the slub stat you posted in last mail shows most allocation goes the fast path. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 4:48 ` Shaohua Li @ 2011-12-12 6:17 ` Alex,Shi 0 siblings, 0 replies; 39+ messages in thread From: Alex,Shi @ 2011-12-12 6:17 UTC (permalink / raw) To: Li, Shaohua Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen > > > > > With the per-cpu partial list, I didn't see any workload which is still > > > > > suffering from the list lock, > > > > > > > > The merge error that you fixed in 3.2-rc1 for hackbench regression is > > > > due to add slub to node partial head. And data of hackbench show node > > > > partial is still heavy used in allocation. > > > The patch is already in base kernel, did you mean even with it you still > > > saw the list locking issue with latest kernel? > > > > > > > Yes, list_lock still hurt performance. It will be helpful if you can do > > some optimize for it. > please post data and the workload. In my test, I didn't see the locking > takes significant time with perf. the slub stat you posted in last mail > shows most allocation goes the fast path. > It is 'hackbench 100 process 2000' How it's proved there are not locking time on list_lock? Does the following profile can express sth on slub alloc with my previous slub stat data? 113517 total 0.0164 11065 unix_stream_recvmsg 5.9425 10862 copy_user_generic_string 169.7188 6011 __alloc_skb 18.7844 5857 __kmalloc_node_track_caller 14.3203 5748 unix_stream_sendmsg 5.4124 4841 kmem_cache_alloc_node 17.3513 4494 skb_release_head_state 34.0455 3697 __slab_free 5.3194 Actually, we know lock content is there, because any change on node partial list may cause clear performance change on hackbench process benchmark. But for direct evidence. Any debug may heavy press its using, and make data unbelievable. So, I'd like hear any suggestions to figure out it clearly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 2:43 ` Shaohua Li 2011-12-12 4:14 ` Alex,Shi @ 2011-12-12 6:09 ` Eric Dumazet 2011-12-14 1:29 ` David Rientjes 2 siblings, 0 replies; 39+ messages in thread From: Eric Dumazet @ 2011-12-12 6:09 UTC (permalink / raw) To: Shaohua Li Cc: David Rientjes, Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen Le lundi 12 dA(C)cembre 2011 A 10:43 +0800, Shaohua Li a A(C)crit : > With the per-cpu partial list, I didn't see any workload which is still > suffering from the list lock, so I suppose both the trashing approach > and pick 25% used slab approach don't help. The per-cpu partial list > flushes the whole per-cpu partial list after s->cpu_partial objects are > freed, this is a little aggressive, because the per-cpu partial list > need refilled again soon after an allocation. I had experiment to have > separate per-cpu alloc/free partial list, which can avoid this. but > again, I didn't see any workload still suffering list lock issue even > with netperf which stress slub much. did you see such workload? You can have such a workload using pktgen, to force stress on receiver, and packet losses because of socket backlog limit reached. Sender machine : pktgen (sending UDP packets to target) Receiver : - SMP machine, - mono threaded application reading packets as fast as possible. - But softirq handler running on a separate cpu. Result : Only a part of allocated skbs (two slub allocations per skb), are delivered to socket, others are freed by softirq handler. Using net-next tree (where some drivers use the new build_skb() service right before packet being delivered to network stack) CPU0: softirq handler CPU4: application cpu $ cd /sys/kernel/slab/:t-0002048 ; grep . * aliases:0 align:8 grep: alloc_calls: Function not implemented alloc_fastpath:8199220 C0=8196915 C1=306 C2=63 C3=297 C4=319 C5=550 C6=722 C7=48 alloc_from_partial:13931406 C0=13931401 C3=1 C5=4 alloc_node_mismatch:0 alloc_refill:70871657 C0=70871629 C1=2 C3=3 C4=9 C5=11 C6=3 alloc_slab:1335 C0=1216 C1=17 C2=2 C3=15 C4=17 C5=22 C6=44 C7=2 alloc_slowpath:155455299 C0=155455144 C1=18 C2=1 C3=21 C4=27 C5=40 C6=47 C7=1 cache_dma:0 cmpxchg_double_cpu_fail:0 cmpxchg_double_fail:27341 C0=12769 C4=14572 cpu_partial:6 cpu_partial_alloc:70650909 C0=70650899 C3=3 C4=2 C5=4 C6=1 cpu_partial_free:136279924 C0=71504388 C1=13 C2=1 C3=52 C4=64775461 C5=6 C6=2 C7=1 cpuslab_flush:0 cpu_slabs:29 deactivate_bypass:84583642 C0=84583515 C1=16 C2=1 C3=18 C4=18 C5=29 C6=44 C7=1 deactivate_empty:570 C0=80 C3=34 C4=456 deactivate_full:0 deactivate_remote_frees:0 deactivate_to_head:0 deactivate_to_tail:0 destroy_by_rcu:0 free_add_partial:0 grep: free_calls: Function not implemented free_fastpath:89153 C0=88972 C1=34 C2=35 C3=27 C4=12 C5=23 C6=30 C7=20 free_frozen:5971363 C0=554097 C1=196 C2=14 C3=730 C4=5416278 C5=16 C6=19 C7=13 free_remove_partial:401 C1=1 C4=400 free_slab:971 C0=80 C1=1 C3=34 C4=856 free_slowpath:92913113 C0=21090357 C1=212 C2=15 C3=784 C4=71821691 C5=19 C6=21 C7=14 hwcache_align:0 min_partial:5 objects:1873 object_size:2048 objects_partial:945 objs_per_slab:16 order:3 order_fallback:0 partial:306 poison:0 reclaim_account:0 red_zone:0 reserved:0 sanity_checks:0 slabs:364 slabs_cpu_partial:21(21) C0=3(3) C1=6(6) C2=1(1) C3=7(7) C5=2(2) C6=1(1) C7=1(1) slab_size:2048 store_user:0 total_objects:5824 trace:0 $ cd /sys/kernel/slab/skbuff_head_cache ; grep . * aliases:6 align:8 grep: alloc_calls: Function not implemented alloc_fastpath:89181782 C0=89173048 C1=1599 C2=1357 C3=2140 C4=802 C5=675 C6=638 C7=1523 alloc_from_partial:412658 C0=412658 alloc_node_mismatch:0 alloc_refill:593417 C0=593189 C1=19 C2=15 C3=24 C4=51 C5=18 C6=17 C7=84 alloc_slab:2831313 C0=2831285 C1=2 C2=2 C3=2 C4=2 C5=12 C6=4 C7=4 alloc_slowpath:4430371 C0=4430112 C1=20 C2=17 C3=25 C4=57 C5=31 C6=21 C7=88 cache_dma:0 cmpxchg_double_cpu_fail:0 cmpxchg_double_fail:1 C0=1 cpu_partial:30 cpu_partial_alloc:592991 C0=592981 C2=1 C4=5 C5=2 C6=1 C7=1 cpu_partial_free:4429836 C0=592981 C1=25 C2=19 C3=23 C4=3836767 C5=6 C6=8 C7=7 cpuslab_flush:0 cpu_slabs:107 deactivate_bypass:3836954 C0=3836923 C1=1 C2=2 C3=1 C4=6 C5=13 C6=4 C7=4 deactivate_empty:2831168 C4=2831168 deactivate_full:0 deactivate_remote_frees:0 deactivate_to_head:0 deactivate_to_tail:0 destroy_by_rcu:0 free_add_partial:0 grep: free_calls: Function not implemented free_fastpath:21192924 C0=21186268 C1=1420 C2=1204 C3=1966 C4=572 C5=349 C6=380 C7=765 free_frozen:67988498 C0=516 C1=121 C2=85 C3=841 C4=67986468 C5=215 C6=76 C7=176 free_remove_partial:18 C4=18 free_slab:2831186 C4=2831186 free_slowpath:71825749 C0=609 C1=146 C2=104 C3=864 C4=71823538 C5=221 C6=84 C7=183 hwcache_align:0 min_partial:5 objects:2494 object_size:192 objects_partial:121 objs_per_slab:21 order:0 order_fallback:0 partial:14 poison:0 reclaim_account:0 red_zone:0 reserved:0 sanity_checks:0 slabs:127 slabs_cpu_partial:99(99) C1=25(25) C2=18(18) C3=23(23) C4=16(16) C5=4(4) C6=7(7) C7=6(6) slab_size:192 store_user:0 total_objects:2667 trace:0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-12 2:43 ` Shaohua Li 2011-12-12 4:14 ` Alex,Shi 2011-12-12 6:09 ` Eric Dumazet @ 2011-12-14 1:29 ` David Rientjes 2011-12-14 2:43 ` Shaohua Li 2 siblings, 1 reply; 39+ messages in thread From: David Rientjes @ 2011-12-14 1:29 UTC (permalink / raw) To: Shaohua Li Cc: Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Mon, 12 Dec 2011, Shaohua Li wrote: > With the per-cpu partial list, I didn't see any workload which is still > suffering from the list lock, so I suppose both the trashing approach > and pick 25% used slab approach don't help. This doesn't necessarily have anything to do with contention on list_lock, it has to do with the fact that ~99% of allocations come from the slowpath since the cpu slab only has one free object when it is activated, that's what the statistics indicated for kmalloc-256 and kmalloc-2k. That's what I called "slab thrashing": the continual deactivation of the cpu slab and picking from the partial list that would only have one or two free objects causing the vast majority of allocations to require the slowpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 1:29 ` David Rientjes @ 2011-12-14 2:43 ` Shaohua Li 2011-12-14 2:38 ` David Rientjes 0 siblings, 1 reply; 39+ messages in thread From: Shaohua Li @ 2011-12-14 2:43 UTC (permalink / raw) To: David Rientjes Cc: Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Wed, 2011-12-14 at 09:29 +0800, David Rientjes wrote: > On Mon, 12 Dec 2011, Shaohua Li wrote: > > > With the per-cpu partial list, I didn't see any workload which is still > > suffering from the list lock, so I suppose both the trashing approach > > and pick 25% used slab approach don't help. > > This doesn't necessarily have anything to do with contention on list_lock, > it has to do with the fact that ~99% of allocations come from the slowpath > since the cpu slab only has one free object when it is activated, that's > what the statistics indicated for kmalloc-256 and kmalloc-2k. That's what > I called "slab thrashing": the continual deactivation of the cpu slab and > picking from the partial list that would only have one or two free objects > causing the vast majority of allocations to require the slowpath. if vast majority of allocation needs picking from partial list of node, the list_lock will have contention too. But I'd say avoiding the slab thrashing does increase fastpath. How much it can improve performance I don't know. The slowpath (not involving list_lock case, so picking per-cpu partial list) is already _very_ fast these days. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 2:43 ` Shaohua Li @ 2011-12-14 2:38 ` David Rientjes 0 siblings, 0 replies; 39+ messages in thread From: David Rientjes @ 2011-12-14 2:38 UTC (permalink / raw) To: Shaohua Li Cc: Shi, Alex, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen On Wed, 14 Dec 2011, Shaohua Li wrote: > if vast majority of allocation needs picking from partial list of node, > the list_lock will have contention too. But I'd say avoiding the slab > thrashing does increase fastpath. Right, that's why my 2009 patchset would attempt to grab the partial slab with the highest number of free objects to a certain threshold before falling back to others and it improved performance somewhat. This was with the per-node partial lists, however, and the slowpath has been significantly rewritten since then. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-02 14:43 ` Christoph Lameter 2011-12-05 9:22 ` Alex,Shi @ 2011-12-09 8:30 ` Alex,Shi 2011-12-09 10:10 ` David Rientjes 1 sibling, 1 reply; 39+ messages in thread From: Alex,Shi @ 2011-12-09 8:30 UTC (permalink / raw) To: Christoph Lameter, David Rientjes Cc: penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet On Fri, 2011-12-02 at 22:43 +0800, Christoph Lameter wrote: > On Fri, 2 Dec 2011, Alex Shi wrote: > > > From: Alex Shi <alexs@intel.com> > > > > Times performance regression were due to slub add to node partial head > > or tail. That inspired me to do tunning on the node partial adding, to > > set a criteria for head or tail position selection when do partial > > adding. > > My experiment show, when used objects is less than 1/4 total objects > > of slub performance will get about 1.5% improvement on netperf loopback > > testing with 2048 clients, wherever on our 4 or 2 sockets platforms, > > includes sandbridge or core2. > > The number of free objects in a slab may have nothing to do with cache > hotness of all objects in the slab. You can only be sure that one object > (the one that was freed) is cache hot. Netperf may use them in sequence > and therefore you are likely to get series of frees on the same slab > page. How are other benchmarks affected by this change? I did some experiments on add_partial judgment against rc4, like to put the slub into node partial head or tail according to free objects, or like Eric's suggest to combine the external parameter, like below: n->nr_partial++; - if (tail == DEACTIVATE_TO_TAIL) + if (tail == DEACTIVATE_TO_TAIL || + page->inuse > page->objects /2) list_add_tail(&page->lru, &n->partial); else list_add(&page->lru, &n->partial); But the result is out of my expectation before. Now we set all of slub into the tail of node partial, we get the best performance, even it is just a slight improvement. { n->nr_partial++; - if (tail == DEACTIVATE_TO_TAIL) - list_add_tail(&page->lru, &n->partial); - else - list_add(&page->lru, &n->partial); + list_add_tail(&page->lru, &n->partial); } This change can bring about 2% improvement on our WSM-ep machine, and 1% improvement on our SNB-ep and NHM-ex machine. and no clear effect for core2 machine. on hackbench process benchmark. ./hackbench 100 process 2000 For multiple clients loopback netperf, only a suspicious 1% improvement on our 2 sockets machine. and others have no clear effect. But, when I check the deactivate_to_head/to_tail statistics on original code, the to_head is just hundreds or thousands times, while to_tail is called about teens millions times. David, could you like to try above change? move all slub to partial tail. add_partial statistics collection patch: --- commit 1ff731282acb521f3a7c2e3fb94d35ec4d0ff07e Author: Alex Shi <alex.shi@intel.com> Date: Fri Dec 9 18:12:14 2011 +0800 slub: statistics collection for add_partial diff --git a/mm/slub.c b/mm/slub.c index 5843846..a2b1143 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1904,10 +1904,11 @@ static void unfreeze_partials(struct kmem_cache *s) if (l != m) { if (l == M_PARTIAL) remove_partial(n, page); - else + else { add_partial(n, page, DEACTIVATE_TO_TAIL); - + stat(s, DEACTIVATE_TO_TAIL); + } l = m; } @@ -2480,6 +2481,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, remove_full(s, page); add_partial(n, page, DEACTIVATE_TO_TAIL); stat(s, FREE_ADD_PARTIAL); + stat(s, DEACTIVATE_TO_TAIL); } } spin_unlock_irqrestore(&n->list_lock, flags); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-09 8:30 ` Alex,Shi @ 2011-12-09 10:10 ` David Rientjes 2011-12-09 13:40 ` Shi, Alex 2011-12-13 13:01 ` Shi, Alex 0 siblings, 2 replies; 39+ messages in thread From: David Rientjes @ 2011-12-09 10:10 UTC (permalink / raw) To: Alex,Shi Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet On Fri, 9 Dec 2011, Alex,Shi wrote: > I did some experiments on add_partial judgment against rc4, like to put > the slub into node partial head or tail according to free objects, or > like Eric's suggest to combine the external parameter, like below: > > n->nr_partial++; > - if (tail == DEACTIVATE_TO_TAIL) > + if (tail == DEACTIVATE_TO_TAIL || > + page->inuse > page->objects /2) > list_add_tail(&page->lru, &n->partial); > else > list_add(&page->lru, &n->partial); > > But the result is out of my expectation before. I don't think you'll get consistent results for all workloads with something like this, some things may appear better and other things may appear worse. That's why I've always disagreed with determining whether it should be added to the head or to the tail at the time of deactivation: you know nothing about frees happening to that slab subsequent to the decision you've made. The only thing that's guaranteed is that you've through cache hot objects out the window and potentially increased the amount of internally fragmented slabs and/or unnecessarily long partial lists. > Now we set all of slub > into the tail of node partial, we get the best performance, even it is > just a slight improvement. > > { > n->nr_partial++; > - if (tail == DEACTIVATE_TO_TAIL) > - list_add_tail(&page->lru, &n->partial); > - else > - list_add(&page->lru, &n->partial); > + list_add_tail(&page->lru, &n->partial); > } > > This change can bring about 2% improvement on our WSM-ep machine, and 1% > improvement on our SNB-ep and NHM-ex machine. and no clear effect for > core2 machine. on hackbench process benchmark. > > ./hackbench 100 process 2000 > > For multiple clients loopback netperf, only a suspicious 1% improvement > on our 2 sockets machine. and others have no clear effect. > > But, when I check the deactivate_to_head/to_tail statistics on original > code, the to_head is just hundreds or thousands times, while to_tail is > called about teens millions times. > > David, could you like to try above change? move all slub to partial > tail. > > add_partial statistics collection patch: > --- > commit 1ff731282acb521f3a7c2e3fb94d35ec4d0ff07e > Author: Alex Shi <alex.shi@intel.com> > Date: Fri Dec 9 18:12:14 2011 +0800 > > slub: statistics collection for add_partial > > diff --git a/mm/slub.c b/mm/slub.c > index 5843846..a2b1143 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1904,10 +1904,11 @@ static void unfreeze_partials(struct kmem_cache *s) > if (l != m) { > if (l == M_PARTIAL) > remove_partial(n, page); > - else > + else { > add_partial(n, page, > DEACTIVATE_TO_TAIL); > - > + stat(s, DEACTIVATE_TO_TAIL); > + } > l = m; > } > > @@ -2480,6 +2481,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > remove_full(s, page); > add_partial(n, page, DEACTIVATE_TO_TAIL); > stat(s, FREE_ADD_PARTIAL); > + stat(s, DEACTIVATE_TO_TAIL); > } > } > spin_unlock_irqrestore(&n->list_lock, flags); Not sure what you're asking me to test, you would like this: { n->nr_partial++; - if (tail == DEACTIVATE_TO_TAIL) - list_add_tail(&page->lru, &n->partial); - else - list_add(&page->lru, &n->partial); + list_add_tail(&page->lru, &n->partial); } with the statistics patch above? I typically run with CONFIG_SLUB_STATS disabled since it impacts performance so heavily and I'm not sure what information you're looking for with regards to those stats. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-09 10:10 ` David Rientjes @ 2011-12-09 13:40 ` Shi, Alex 2011-12-14 1:38 ` David Rientjes 2011-12-13 13:01 ` Shi, Alex 1 sibling, 1 reply; 39+ messages in thread From: Shi, Alex @ 2011-12-09 13:40 UTC (permalink / raw) To: David Rientjes Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet > > I did some experiments on add_partial judgment against rc4, like to put > > the slub into node partial head or tail according to free objects, or > > like Eric's suggest to combine the external parameter, like below: > > > > n->nr_partial++; > > - if (tail == DEACTIVATE_TO_TAIL) > > + if (tail == DEACTIVATE_TO_TAIL || > > + page->inuse > page->objects /2) > > list_add_tail(&page->lru, &n->partial); > > else > > list_add(&page->lru, &n->partial); > > > > But the result is out of my expectation before. > > I don't think you'll get consistent results for all workloads with > something like this, some things may appear better and other things may > appear worse. That's why I've always disagreed with determining whether > it should be added to the head or to the tail at the time of deactivation: > you know nothing about frees happening to that slab subsequent to the > decision you've made. The only thing that's guaranteed is that you've > through cache hot objects out the window and potentially increased the > amount of internally fragmented slabs and/or unnecessarily long partial > lists. I said it not my original expectation doesn't mean my data has problem. :) Of course any testing may have result variation. But it is benchmark accordingly, and there are lot technical to tuning your testing to make its stand division acceptable, like to sync your system in a clear status, to close unnecessary services, to use separate working disks for your testing etc. etc. For this data, like on my SNB-EP machine, (the following data is not stands for Intel, it is just my personal data). 4 times result of hackbench on this patch are 5.59, 5.475, 5.47833, 5.504 And more results on original rc4 are from 5.54 to 5.61, the stand division of results show is stable and believable on my side. But since in our handreds benchmarks, only hackbench and loopback netperf is sensitive with slub change, and since it seems you did some testing on this. I thought you may like to do a double confirm with real data. In fact, I also collected the 'perf stat' for cache missing or reference data, but that wave too much not stabled like hackbench itself. > Not sure what you're asking me to test, you would like this: > > { > n->nr_partial++; > - if (tail == DEACTIVATE_TO_TAIL) > - list_add_tail(&page->lru, &n->partial); > - else > - list_add(&page->lru, &n->partial); > + list_add_tail(&page->lru, &n->partial); > } > > with the statistics patch above? I typically run with CONFIG_SLUB_STATS > disabled since it impacts performance so heavily and I'm not sure what > information you're looking for with regards to those stats. NO, when you collect data, please close SLUB_STAT in kernel config. _to_head statistics collection patch just tell you, I collected the statistics not include add_partial in early_kmem_cache_node_alloc(). And other places of add_partial were covered. Of course, the kernel with statistic can not be used to measure performance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-09 13:40 ` Shi, Alex @ 2011-12-14 1:38 ` David Rientjes 2011-12-14 2:36 ` David Rientjes 0 siblings, 1 reply; 39+ messages in thread From: David Rientjes @ 2011-12-14 1:38 UTC (permalink / raw) To: Shi, Alex Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet On Fri, 9 Dec 2011, Shi, Alex wrote: > Of course any testing may have result variation. But it is benchmark > accordingly, and there are lot technical to tuning your testing to make > its stand division acceptable, like to sync your system in a clear > status, to close unnecessary services, to use separate working disks for > your testing etc. etc. For this data, like on my SNB-EP machine, (the > following data is not stands for Intel, it is just my personal data). I always run benchmarks with freshly booted machines and disabling all but the most basic and required userspace for my testing environment, I can assure you that my comparison of slab and slub on netperf TCP_RR isn't because of any noise from userspace. > 4 times result of hackbench on this patch are 5.59, 5.475, 5.47833, > 5.504 I haven't been running hackbench benchmarks, sorry. I was always under the assumption that slub still was slightly better than slab with hackbench since that was used as justification for it becoming the default allocator and also because Christoph had patches merged recently which improved its performance on slub. I've been speaking only about my history with netperf TCP_RR when using slub. > > Not sure what you're asking me to test, you would like this: > > > > { > > n->nr_partial++; > > - if (tail == DEACTIVATE_TO_TAIL) > > - list_add_tail(&page->lru, &n->partial); > > - else > > - list_add(&page->lru, &n->partial); > > + list_add_tail(&page->lru, &n->partial); > > } > > > > with the statistics patch above? I typically run with CONFIG_SLUB_STATS > > disabled since it impacts performance so heavily and I'm not sure what > > information you're looking for with regards to those stats. > > NO, when you collect data, please close SLUB_STAT in kernel config. > _to_head statistics collection patch just tell you, I collected the > statistics not include add_partial in early_kmem_cache_node_alloc(). And > other places of add_partial were covered. Of course, the kernel with > statistic can not be used to measure performance. > Ok, I'll benchmark netperf TCP_RR comparing Linus' latest -git both with and without the above change. It was confusing because you had three diffs in your email, I wasn't sure which or combination of which you wanted me to try :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 1:38 ` David Rientjes @ 2011-12-14 2:36 ` David Rientjes 2011-12-14 6:06 ` Alex,Shi 0 siblings, 1 reply; 39+ messages in thread From: David Rientjes @ 2011-12-14 2:36 UTC (permalink / raw) To: Shi, Alex Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet On Tue, 13 Dec 2011, David Rientjes wrote: > > > { > > > n->nr_partial++; > > > - if (tail == DEACTIVATE_TO_TAIL) > > > - list_add_tail(&page->lru, &n->partial); > > > - else > > > - list_add(&page->lru, &n->partial); > > > + list_add_tail(&page->lru, &n->partial); > > > } > > > 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory with netperf-2.4.5 comparing Linus' -git with and without this patch: threads SLUB SLUB+patch 16 116614 117213 (+0.5%) 32 216436 215065 (-0.6%) 48 299991 299399 (-0.2%) 64 373753 374617 (+0.2%) 80 435688 435765 (UNCH) 96 494630 496590 (+0.4%) 112 546766 546259 (-0.1%) This suggests the difference is within the noise, so this patch neither helps nor hurts netperf on my setup, as expected. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 2:36 ` David Rientjes @ 2011-12-14 6:06 ` Alex,Shi 2011-12-14 6:44 ` Eric Dumazet 0 siblings, 1 reply; 39+ messages in thread From: Alex,Shi @ 2011-12-14 6:06 UTC (permalink / raw) To: David Rientjes Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet On Wed, 2011-12-14 at 10:36 +0800, David Rientjes wrote: > On Tue, 13 Dec 2011, David Rientjes wrote: > > > > > { > > > > n->nr_partial++; > > > > - if (tail == DEACTIVATE_TO_TAIL) > > > > - list_add_tail(&page->lru, &n->partial); > > > > - else > > > > - list_add(&page->lru, &n->partial); > > > > + list_add_tail(&page->lru, &n->partial); > > > > } > > > > > > 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory > with netperf-2.4.5 comparing Linus' -git with and without this patch: > > threads SLUB SLUB+patch > 16 116614 117213 (+0.5%) > 32 216436 215065 (-0.6%) > 48 299991 299399 (-0.2%) > 64 373753 374617 (+0.2%) > 80 435688 435765 (UNCH) > 96 494630 496590 (+0.4%) > 112 546766 546259 (-0.1%) > > This suggests the difference is within the noise, so this patch neither > helps nor hurts netperf on my setup, as expected. Thanks for the data. Real netperf is hard to give enough press on SLUB. but as I mentioned before, I also didn't find real performance change on my loopback netperf testing. I retested hackbench again. about 1% performance increase still exists on my 2 sockets SNB/WSM and 4 sockets NHM. and no performance drop for other machines. Christoph, what's comments you like to offer for the results or for this code change? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 6:06 ` Alex,Shi @ 2011-12-14 6:44 ` Eric Dumazet 2011-12-14 6:47 ` Pekka Enberg 2011-12-14 6:56 ` Alex,Shi 0 siblings, 2 replies; 39+ messages in thread From: Eric Dumazet @ 2011-12-14 6:44 UTC (permalink / raw) To: Alex,Shi Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Le mercredi 14 dA(C)cembre 2011 A 14:06 +0800, Alex,Shi a A(C)crit : > On Wed, 2011-12-14 at 10:36 +0800, David Rientjes wrote: > > On Tue, 13 Dec 2011, David Rientjes wrote: > > > > > > > { > > > > > n->nr_partial++; > > > > > - if (tail == DEACTIVATE_TO_TAIL) > > > > > - list_add_tail(&page->lru, &n->partial); > > > > > - else > > > > > - list_add(&page->lru, &n->partial); > > > > > + list_add_tail(&page->lru, &n->partial); > > > > > } > > > > > > > > > 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory > > with netperf-2.4.5 comparing Linus' -git with and without this patch: > > > > threads SLUB SLUB+patch > > 16 116614 117213 (+0.5%) > > 32 216436 215065 (-0.6%) > > 48 299991 299399 (-0.2%) > > 64 373753 374617 (+0.2%) > > 80 435688 435765 (UNCH) > > 96 494630 496590 (+0.4%) > > 112 546766 546259 (-0.1%) > > > > This suggests the difference is within the noise, so this patch neither > > helps nor hurts netperf on my setup, as expected. > > Thanks for the data. Real netperf is hard to give enough press on SLUB. > but as I mentioned before, I also didn't find real performance change on > my loopback netperf testing. > > I retested hackbench again. about 1% performance increase still exists > on my 2 sockets SNB/WSM and 4 sockets NHM. and no performance drop for > other machines. > > Christoph, what's comments you like to offer for the results or for this > code change? I believe far more aggressive mechanism is needed to help these workloads. Please note that the COLD/HOT page concept is not very well used in kernel, because its not really obvious that some decisions are always good (or maybe this is not well known) We should try to batch things a bit, instead of doing a very small unit of work in slow path. We now have a very fast fastpath, but inefficient slow path. SLAB has a litle cache per cpu, we could add one to SLUB for freed objects, not belonging to current slab. This could avoid all these activate/deactivate overhead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 6:44 ` Eric Dumazet @ 2011-12-14 6:47 ` Pekka Enberg 2011-12-14 14:53 ` Christoph Lameter 2011-12-14 6:56 ` Alex,Shi 1 sibling, 1 reply; 39+ messages in thread From: Pekka Enberg @ 2011-12-14 6:47 UTC (permalink / raw) To: Eric Dumazet Cc: Alex,Shi, David Rientjes, Christoph Lameter, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 14 Dec 2011, Eric Dumazet wrote: > We should try to batch things a bit, instead of doing a very small unit > of work in slow path. > > We now have a very fast fastpath, but inefficient slow path. > > SLAB has a litle cache per cpu, we could add one to SLUB for freed > objects, not belonging to current slab. This could avoid all these > activate/deactivate overhead. Yeah, this is definitely worth looking at. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 6:47 ` Pekka Enberg @ 2011-12-14 14:53 ` Christoph Lameter 0 siblings, 0 replies; 39+ messages in thread From: Christoph Lameter @ 2011-12-14 14:53 UTC (permalink / raw) To: Pekka Enberg Cc: Eric Dumazet, Alex,Shi, David Rientjes, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 14 Dec 2011, Pekka Enberg wrote: > On Wed, 14 Dec 2011, Eric Dumazet wrote: > > We should try to batch things a bit, instead of doing a very small unit > > of work in slow path. > > > > We now have a very fast fastpath, but inefficient slow path. > > > > SLAB has a litle cache per cpu, we could add one to SLUB for freed > > objects, not belonging to current slab. This could avoid all these > > activate/deactivate overhead. > > Yeah, this is definitely worth looking at. We have been down this road repeatedly. Nick tried it, I tried it and neither got us to something we liked. Please consult the archives. There was a whole patch series last year that I did introducing per cpu caches which ended up in the "unified" patches. See the archives for the various attempts please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 6:44 ` Eric Dumazet 2011-12-14 6:47 ` Pekka Enberg @ 2011-12-14 6:56 ` Alex,Shi 2011-12-14 14:59 ` Christoph Lameter 1 sibling, 1 reply; 39+ messages in thread From: Alex,Shi @ 2011-12-14 6:56 UTC (permalink / raw) To: Eric Dumazet Cc: David Rientjes, Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org > > Thanks for the data. Real netperf is hard to give enough press on SLUB. > > but as I mentioned before, I also didn't find real performance change on > > my loopback netperf testing. > > > > I retested hackbench again. about 1% performance increase still exists > > on my 2 sockets SNB/WSM and 4 sockets NHM. and no performance drop for > > other machines. > > > > Christoph, what's comments you like to offer for the results or for this > > code change? > > I believe far more aggressive mechanism is needed to help these > workloads. > > Please note that the COLD/HOT page concept is not very well used in > kernel, because its not really obvious that some decisions are always > good (or maybe this is not well known) Hope Christoph know everything of SLUB. :) > > We should try to batch things a bit, instead of doing a very small unit > of work in slow path. > > We now have a very fast fastpath, but inefficient slow path. > > SLAB has a litle cache per cpu, we could add one to SLUB for freed > objects, not belonging to current slab. This could avoid all these > activate/deactivate overhead. Maybe worth to try or maybe Christoph had studied this? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 6:56 ` Alex,Shi @ 2011-12-14 14:59 ` Christoph Lameter 2011-12-14 17:33 ` Eric Dumazet 0 siblings, 1 reply; 39+ messages in thread From: Christoph Lameter @ 2011-12-14 14:59 UTC (permalink / raw) To: Alex,Shi Cc: Eric Dumazet, David Rientjes, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 14 Dec 2011, Alex,Shi wrote: > > Please note that the COLD/HOT page concept is not very well used in > > kernel, because its not really obvious that some decisions are always > > good (or maybe this is not well known) > > Hope Christoph know everything of SLUB. :) Well yes we have been back and forth on hot/cold page things repeatedly in the page allocator as well. Caching is not always good. There are particular loads that usually do very well with caching. Others do not. Caching can cause useless processing and pollute caching. It is also a cause for OS noise due to cache maintenance at random (for the app guys) times where they do not want that to happen. > > We should try to batch things a bit, instead of doing a very small unit > > of work in slow path. > > > > We now have a very fast fastpath, but inefficient slow path. > > > > SLAB has a litle cache per cpu, we could add one to SLUB for freed > > objects, not belonging to current slab. This could avoid all these > > activate/deactivate overhead. > > Maybe worth to try or maybe Christoph had studied this? Many people have done patchsets like this. There are various permutations on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been proposed over the years. Caches tend to grow and get rather numerous (see SLAB) and the design of SLUB was to counter that. There is a reason it was called SLUB. The U stands for Unqueued and was intended to avoid the excessive caching problems that I ended up when reworking SLAB for NUMA support. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 14:59 ` Christoph Lameter @ 2011-12-14 17:33 ` Eric Dumazet 2011-12-14 18:26 ` Christoph Lameter 0 siblings, 1 reply; 39+ messages in thread From: Eric Dumazet @ 2011-12-14 17:33 UTC (permalink / raw) To: Christoph Lameter Cc: Alex,Shi, David Rientjes, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Le mercredi 14 dA(C)cembre 2011 A 08:59 -0600, Christoph Lameter a A(C)crit : > Many people have done patchsets like this. Things changed a lot recently. There is room for improvements. At least we can exchange ideas _before_ coding a new patchset ? > There are various permutations > on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been > proposed over the years. Caches tend to grow and get rather numerous (see > SLAB) and the design of SLUB was to counter that. There is a reason it was > called SLUB. The U stands for Unqueued and was intended to avoid the > excessive caching problems that I ended up when reworking SLAB for NUMA > support. Current 'one active slab' per cpu is a one level cache. It really is a _queue_ containing a fair amount of objects. 'Unqueued' in SLUB is marketing hype :=) When we have one producer (say network interrupt handler) feeding millions of network packets to N consumers (other cpus), each free is slowpath. They all want to touch page->freelist and slow the producer as well because of false sharing. Furthermore, when the producer hits socket queue limits, it mostly frees skbs that were allocated in the 'not very recent past', and its own freeing also hit slow path (because memory blocks of the skb are no longer in the current active slab). It competes with frees done by consumers as well. Adding a second _small_ cache to queue X objects per cpu would help to keep the active slab longer and more 'private' (its 'struct page' not touched too often by other cpus) for a given cpu. It would limit number of cache line misses we currently have because of conflicting accesses to page->freelist just to push one _single_ object (and n->list_lock in less extent) My initial idea would be to use a cache of 4 slots per cpu, but be able to queue many objects per slot, if they all belong to same slab/page. In case we must make room in the cache (all slots occupied), we take one slot and dequeue all objects in one round. No extra latency compared to current schem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-14 17:33 ` Eric Dumazet @ 2011-12-14 18:26 ` Christoph Lameter 0 siblings, 0 replies; 39+ messages in thread From: Christoph Lameter @ 2011-12-14 18:26 UTC (permalink / raw) To: Eric Dumazet Cc: Alex,Shi, David Rientjes, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 14 Dec 2011, Eric Dumazet wrote: > > Many people have done patchsets like this. > > Things changed a lot recently. There is room for improvements. > > At least we can exchange ideas _before_ coding a new patchset ? Sure but I hope we do not simply rehash what has been said before and recode what others have code in years past. > > There are various permutations > > on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been > > proposed over the years. Caches tend to grow and get rather numerous (see > > SLAB) and the design of SLUB was to counter that. There is a reason it was > > called SLUB. The U stands for Unqueued and was intended to avoid the > > excessive caching problems that I ended up when reworking SLAB for NUMA > > support. > > Current 'one active slab' per cpu is a one level cache. > > It really is a _queue_ containing a fair amount of objects. In some sense you are right. It is a set of objects linked together. That can be called a queue and it has certain cache hot effects. It is not a qeueue in the SLAB sense meaning a configurable array of pointers. > My initial idea would be to use a cache of 4 slots per cpu, but be able > to queue many objects per slot, if they all belong to same slab/page. Nick did that. Please read up on his work. I think it was named SLQB. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding 2011-12-09 10:10 ` David Rientjes 2011-12-09 13:40 ` Shi, Alex @ 2011-12-13 13:01 ` Shi, Alex 1 sibling, 0 replies; 39+ messages in thread From: Shi, Alex @ 2011-12-13 13:01 UTC (permalink / raw) To: Shi, Alex, David Rientjes Cc: Christoph Lameter, penberg@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Eric Dumazet > > > - if (tail == DEACTIVATE_TO_TAIL) > > > + if (tail == DEACTIVATE_TO_TAIL || > > > + page->inuse > page->objects /2) > > > list_add_tail(&page->lru, &n->partial); > > > else > > > list_add(&page->lru, &n->partial); > > > > > with the statistics patch above? I typically run with CONFIG_SLUB_STATS > > disabled since it impacts performance so heavily and I'm not sure what > > information you're looking for with regards to those stats. > > NO, when you collect data, please close SLUB_STAT in kernel config. _to_head > statistics collection patch just tell you, I collected the statistics not include > add_partial in early_kmem_cache_node_alloc(). And other places of > add_partial were covered. Of course, the kernel with statistic can not be used > to measure performance. David, Did you have time to give a try? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2011-12-14 18:26 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-02 8:23 [PATCH 1/3] slub: set a criteria for slub node partial adding Alex Shi 2011-12-02 8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi 2011-12-02 8:23 ` [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter Alex Shi 2011-12-02 14:44 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Christoph Lameter 2011-12-06 21:08 ` David Rientjes 2011-12-02 11:36 ` [PATCH 1/3] slub: set a criteria for slub node partial adding Eric Dumazet 2011-12-02 20:02 ` Christoph Lameter 2011-12-05 2:21 ` Shaohua Li 2011-12-05 10:01 ` Alex,Shi 2011-12-05 3:28 ` Alex,Shi 2011-12-02 14:43 ` Christoph Lameter 2011-12-05 9:22 ` Alex,Shi 2011-12-06 21:06 ` David Rientjes 2011-12-07 5:11 ` Shaohua Li 2011-12-07 7:28 ` David Rientjes 2011-12-12 2:43 ` Shaohua Li 2011-12-12 4:14 ` Alex,Shi 2011-12-12 4:35 ` Shaohua Li 2011-12-12 4:25 ` Alex,Shi 2011-12-12 4:48 ` Shaohua Li 2011-12-12 6:17 ` Alex,Shi 2011-12-12 6:09 ` Eric Dumazet 2011-12-14 1:29 ` David Rientjes 2011-12-14 2:43 ` Shaohua Li 2011-12-14 2:38 ` David Rientjes 2011-12-09 8:30 ` Alex,Shi 2011-12-09 10:10 ` David Rientjes 2011-12-09 13:40 ` Shi, Alex 2011-12-14 1:38 ` David Rientjes 2011-12-14 2:36 ` David Rientjes 2011-12-14 6:06 ` Alex,Shi 2011-12-14 6:44 ` Eric Dumazet 2011-12-14 6:47 ` Pekka Enberg 2011-12-14 14:53 ` Christoph Lameter 2011-12-14 6:56 ` Alex,Shi 2011-12-14 14:59 ` Christoph Lameter 2011-12-14 17:33 ` Eric Dumazet 2011-12-14 18:26 ` Christoph Lameter 2011-12-13 13:01 ` Shi, Alex
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).