From: Pekka Enberg <penberg@cs.helsinki.fi>
To: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>,
Andi Kleen <andi@firstfloor.org>, Matthew Wilcox <matthew@wil.cx>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Andrew Morton <akpm@linux-foundation.org>,
netdev@vger.kernel.org, sfr@canb.auug.org.au,
matthew.r.wilcox@intel.com, chinang.ma@intel.com,
linux-kernel@vger.kernel.org, sharad.c.tripathi@intel.com,
arjan@linux.intel.com, suresh.b.siddha@intel.com,
harita.chilukuri@intel.com, douglas.w.styner@intel.com,
peter.xihong.wang@intel.com, hubert.nueckel@intel.com,
chris.mason@oracle.com, srostedt@redhat.com,
linux-scsi@vger.kernel.org, andrew.vasquez@qlogic.com,
anirban.chakraborty@qlogic.com, mingo@elte.hu
Subject: Re: Mainline kernel OLTP performance update
Date: Fri, 23 Jan 2009 11:46:29 +0200 [thread overview]
Message-ID: <1232703989.6094.29.camel@penberg-laptop> (raw)
In-Reply-To: <1232699401.11429.163.camel@ymzhang>
On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > > is about 10% better than SLUB's.
> > > >
> > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > >
> > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf
> > > counters patch to diagnose this:
> > >
> > > http://lkml.org/lkml/2009/1/21/273
> > >
> > > And do oprofile, of course. Thanks!
> >
> > I assume binding the client and the server to different physical CPUs
> > also means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> >
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> >
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.
That's bit surprising, actually. FWIW, I've included a patch for empty
slab lists. But it's probably not going to help here.
> > there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
>
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % app name symbol name
> 132779 32.9951 vmlinux copy_user_generic_string
> 25334 6.2954 vmlinux schedule
> 21032 5.2264 vmlinux tg_shares_up
> 17175 4.2679 vmlinux __skb_recv_datagram
> 9091 2.2591 vmlinux sock_def_readable
> 8934 2.2201 vmlinux mwait_idle
> 8796 2.1858 vmlinux try_to_wake_up
> 6940 1.7246 vmlinux __slab_free
>
> #slaninfo -AD
> Name Objects Alloc Free %Fast
> :0000256 1643 5215544 5214027 94 0
> kmalloc-8192 28 5189576 5189560 0 0
> :0000168 2631 141466 138976 92 28
> :0004096 1452 88697 87269 99 96
> :0000192 3402 63050 59732 89 11
> :0000064 6265 46611 40721 98 82
> :0000128 1895 30429 28654 93 32
Looking at __slab_free(), unless page->inuse is constantly zero and we
discard the slab, it really is just cache effects (10% sounds like a
lot, though!). AFAICT, the only way to optimize that is with Christoph's
unfinished pointer freelists patches or with a remote free list like in
SLQB.
Pekka
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3bd3662..41a4c1a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -48,6 +48,9 @@ struct kmem_cache_node {
unsigned long nr_partial;
unsigned long min_partial;
struct list_head partial;
+ unsigned long nr_empty;
+ unsigned long max_empty;
+ struct list_head empty;
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 8fad23f..5a12597 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,11 @@
*/
#define MAX_PARTIAL 10
+/*
+ * Maximum number of empty slabs.
+ */
+#define MAX_EMPTY 1
+
#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)
@@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
free_slab(s, page);
}
+static void discard_or_cache_slab(struct kmem_cache *s, struct page *page)
+{
+ struct kmem_cache_node *n;
+ int node;
+
+ node = page_to_nid(page);
+ n = get_node(s, node);
+
+ dec_slabs_node(s, node, page->objects);
+
+ if (likely(n->nr_empty >= n->max_empty)) {
+ free_slab(s, page);
+ } else {
+ n->nr_empty++;
+ list_add(&page->lru, &n->partial);
+ }
+}
+
/*
* Per slab locking using the pagelock
*/
@@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
}
/*
- * Lock slab and remove from the partial list.
+ * Lock slab and remove from the partial or empty list.
*
* Must hold list_lock.
*/
@@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
{
if (slab_trylock(page)) {
list_del(&page->lru);
- n->nr_partial--;
__SetPageSlubFrozen(page);
return 1;
}
@@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
/*
* Try to allocate a partial slab from a specific node.
*/
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_or_empty_node(struct kmem_cache_node *n)
{
struct page *page;
@@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
* partial slab and there is none available then get_partials()
* will return NULL.
*/
- if (!n || !n->nr_partial)
+ if (!n || (!n->nr_partial && !n->nr_empty))
return NULL;
spin_lock(&n->list_lock);
+
list_for_each_entry(page, &n->partial, lru)
- if (lock_and_freeze_slab(n, page))
+ if (lock_and_freeze_slab(n, page)) {
+ n->nr_partial--;
+ goto out;
+ }
+
+ list_for_each_entry(page, &n->empty, lru)
+ if (lock_and_freeze_slab(n, page)) {
+ n->nr_empty--;
goto out;
+ }
page = NULL;
out:
spin_unlock(&n->list_lock);
@@ -1297,7 +1328,7 @@ out:
/*
* Get a page from somewhere. Search in increasing NUMA distances.
*/
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
@@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
n->nr_partial > n->min_partial) {
- page = get_partial_node(n);
+ page = get_partial_or_empty_node(n);
if (page)
return page;
}
@@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
}
/*
- * Get a partial page, lock it and return it.
+ * Get a partial or empty page, lock it and return it.
*/
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *
+get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node)
{
struct page *page;
int searchnode = (node == -1) ? numa_node_id() : node;
- page = get_partial_node(get_node(s, searchnode));
+ page = get_partial_or_empty_node(get_node(s, searchnode));
if (page || (flags & __GFP_THISNODE))
return page;
- return get_any_partial(s, flags);
+ return get_any_partial_or_empty(s, flags);
}
/*
@@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
} else {
slab_unlock(page);
stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
- discard_slab(s, page);
+ discard_or_cache_slab(s, page);
}
}
}
@@ -1542,7 +1574,7 @@ another_slab:
deactivate_slab(s, c);
new_slab:
- new = get_partial(s, gfpflags, node);
+ new = get_partial_or_empty(s, gfpflags, node);
if (new) {
c->page = new;
stat(c, ALLOC_FROM_PARTIAL);
@@ -1693,7 +1725,7 @@ slab_empty:
}
slab_unlock(page);
stat(c, FREE_SLAB);
- discard_slab(s, page);
+ discard_or_cache_slab(s, page);
return;
debug:
@@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
static void
init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
{
+ spin_lock_init(&n->list_lock);
+
n->nr_partial = 0;
/*
@@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
else if (n->min_partial > MAX_PARTIAL)
n->min_partial = MAX_PARTIAL;
- spin_lock_init(&n->list_lock);
INIT_LIST_HEAD(&n->partial);
+
+ n->nr_empty = 0;
+ /*
+ * XXX: This needs to take object size into account. We don't need
+ * empty slabs for caches which will have plenty of partial slabs
+ * available. Only caches that have either full or empty slabs need
+ * this kind of optimization.
+ */
+ n->max_empty = MAX_EMPTY;
+ INIT_LIST_HEAD(&n->empty);
+
#ifdef CONFIG_SLUB_DEBUG
atomic_long_set(&n->nr_slabs, 0);
atomic_long_set(&n->total_objects, 0);
@@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
spin_unlock_irqrestore(&n->list_lock, flags);
}
+static void free_empty_slabs(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+ struct page *page, *t;
+ unsigned long flags;
+
+ n = get_node(s, node);
+
+ if (!n->nr_empty)
+ continue;
+
+ spin_lock_irqsave(&n->list_lock, flags);
+
+ list_for_each_entry_safe(page, t, &n->empty, lru) {
+ list_del(&page->lru);
+ n->nr_empty--;
+
+ free_slab(s, page);
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
+}
+
/*
* Release all resources used by a slab cache.
*/
@@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s)
flush_all(s);
+ free_empty_slabs(s);
+
/* Attempt to free all objects */
free_kmem_cache_cpus(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s)
return -ENOMEM;
flush_all(s);
+ free_empty_slabs(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
n = get_node(s, node);
next prev parent reply other threads:[~2009-01-23 9:46 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <BC02C49EEB98354DBA7F5DD76F2A9E800317003CB0@azsmsx501.amr.corp.intel.com>
[not found] ` <CAC4B8726E86A142B27A9E9A2F2F12479D3E945F@rrsmsx505.amr.corp.intel.com>
2009-01-15 0:35 ` Mainline kernel OLTP performance update Andrew Morton
2009-01-15 1:21 ` Matthew Wilcox
2009-01-15 2:04 ` Andrew Morton
2009-01-15 2:27 ` Steven Rostedt
2009-01-15 7:11 ` Ma, Chinang
2009-01-19 18:04 ` Chris Mason
2009-01-19 18:37 ` Steven Rostedt
[not found] ` <1232390259.25783.5.camel@localhost.localdomain>
2009-01-19 18:55 ` Chris Mason
2009-01-19 19:07 ` Steven Rostedt
2009-01-19 23:40 ` Ingo Molnar
2009-01-15 2:39 ` Andi Kleen
2009-01-15 2:47 ` Matthew Wilcox
2009-01-15 3:36 ` Andi Kleen
2009-01-20 13:27 ` Jens Axboe
[not found] ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com>
2009-01-22 11:29 ` Jens Axboe
[not found] ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com>
2009-01-27 8:28 ` Jens Axboe
2009-01-15 7:24 ` Nick Piggin
2009-01-15 9:46 ` Pekka Enberg
2009-01-15 13:52 ` Matthew Wilcox
2009-01-15 14:42 ` Pekka Enberg
2009-01-16 10:16 ` Pekka Enberg
2009-01-16 10:21 ` Nick Piggin
2009-01-16 10:31 ` Pekka Enberg
2009-01-16 10:42 ` Nick Piggin
2009-01-16 10:55 ` Pekka Enberg
2009-01-19 7:13 ` Nick Piggin
2009-01-19 8:05 ` Pekka Enberg
2009-01-19 8:33 ` Nick Piggin
2009-01-19 8:42 ` Nick Piggin
2009-01-19 8:47 ` Pekka Enberg
2009-01-19 8:57 ` Nick Piggin
2009-01-19 9:48 ` Pekka Enberg
2009-01-19 10:03 ` Nick Piggin
2009-01-16 20:59 ` Christoph Lameter
2009-01-16 0:27 ` Andrew Morton
2009-01-16 4:03 ` Nick Piggin
2009-01-16 4:12 ` Andrew Morton
2009-01-16 6:46 ` Nick Piggin
2009-01-16 6:55 ` Matthew Wilcox
2009-01-16 7:06 ` Nick Piggin
2009-01-16 7:53 ` Zhang, Yanmin
2009-01-16 10:20 ` Andi Kleen
2009-01-20 5:16 ` Zhang, Yanmin
2009-01-21 23:58 ` Christoph Lameter
2009-01-22 8:36 ` Zhang, Yanmin
2009-01-22 9:15 ` Pekka Enberg
2009-01-22 9:28 ` Zhang, Yanmin
2009-01-22 9:47 ` Pekka Enberg
2009-01-23 3:02 ` Zhang, Yanmin
2009-01-23 6:52 ` Pekka Enberg
2009-01-23 8:06 ` Pekka Enberg
2009-01-23 8:30 ` Zhang, Yanmin
2009-01-23 8:40 ` Pekka Enberg
2009-01-23 9:46 ` Pekka Enberg [this message]
2009-01-23 15:22 ` Christoph Lameter
2009-01-23 15:31 ` Pekka Enberg
2009-01-23 15:55 ` Christoph Lameter
2009-01-23 16:01 ` Pekka Enberg
2009-01-24 2:55 ` Zhang, Yanmin
2009-01-24 7:36 ` Pekka Enberg
2009-02-12 5:22 ` Zhang, Yanmin
2009-02-12 5:47 ` Zhang, Yanmin
2009-02-12 15:25 ` Christoph Lameter
2009-02-12 16:07 ` Pekka Enberg
2009-02-12 16:03 ` Pekka Enberg
2009-01-26 17:36 ` Christoph Lameter
2009-02-01 2:52 ` Zhang, Yanmin
2009-01-23 8:33 ` Nick Piggin
2009-01-23 9:02 ` Zhang, Yanmin
2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
2009-01-23 18:51 ` Grant Grundler
2009-01-24 3:03 ` Zhang, Yanmin
2009-01-26 18:26 ` Rick Jones
2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton
2009-01-16 7:25 ` Nick Piggin
2009-01-16 8:59 ` Nick Piggin
2009-01-16 18:11 ` Rick Jones
2009-01-19 7:43 ` Nick Piggin
2009-01-19 22:19 ` Rick Jones
2009-01-15 14:12 ` James Bottomley
2009-01-15 17:44 ` Andrew Morton
2009-01-15 18:00 ` Matthew Wilcox
2009-01-15 18:14 ` Steven Rostedt
2009-01-15 18:44 ` Gregory Haskins
2009-01-15 18:46 ` Wilcox, Matthew R
2009-01-15 19:44 ` Ma, Chinang
2009-01-16 18:14 ` Gregory Haskins
2009-01-16 19:09 ` Steven Rostedt
2009-01-20 12:45 ` Gregory Haskins
2009-01-15 19:28 ` Ma, Chinang
2009-01-15 16:48 ` Ma, Chinang
[not found] <D7C42C27E6CB1E4D8CBDF2F81EA92A260345836A3A@azsmsx501.amr.corp.intel.com>
2009-04-27 7:02 ` Andi Kleen
2009-04-28 16:57 ` Chuck Ebbert
2009-04-28 17:15 ` James Bottomley
2009-04-28 17:17 ` Styner, Douglas W
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1232703989.6094.29.camel@penberg-laptop \
--to=penberg@cs.helsinki.fi \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=andrew.vasquez@qlogic.com \
--cc=anirban.chakraborty@qlogic.com \
--cc=arjan@linux.intel.com \
--cc=chinang.ma@intel.com \
--cc=chris.mason@oracle.com \
--cc=cl@linux-foundation.org \
--cc=douglas.w.styner@intel.com \
--cc=harita.chilukuri@intel.com \
--cc=hubert.nueckel@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=matthew.r.wilcox@intel.com \
--cc=matthew@wil.cx \
--cc=mingo@elte.hu \
--cc=netdev@vger.kernel.org \
--cc=nickpiggin@yahoo.com.au \
--cc=peter.xihong.wang@intel.com \
--cc=sfr@canb.auug.org.au \
--cc=sharad.c.tripathi@intel.com \
--cc=srostedt@redhat.com \
--cc=suresh.b.siddha@intel.com \
--cc=yanmin_zhang@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox