Linux SCSI subsystem development
 help / color / mirror / Atom feed
From: Pekka Enberg <penberg@cs.helsinki.fi>
To: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>,
	Andi Kleen <andi@firstfloor.org>, Matthew Wilcox <matthew@wil.cx>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	netdev@vger.kernel.org, sfr@canb.auug.org.au,
	matthew.r.wilcox@intel.com, chinang.ma@intel.com,
	linux-kernel@vger.kernel.org, sharad.c.tripathi@intel.com,
	arjan@linux.intel.com, suresh.b.siddha@intel.com,
	harita.chilukuri@intel.com, douglas.w.styner@intel.com,
	peter.xihong.wang@intel.com, hubert.nueckel@intel.com,
	chris.mason@oracle.com, srostedt@redhat.com,
	linux-scsi@vger.kernel.org, andrew.vasquez@qlogic.com,
	anirban.chakraborty@qlogic.com, mingo@elte.hu
Subject: Re: Mainline kernel OLTP performance update
Date: Fri, 23 Jan 2009 11:46:29 +0200	[thread overview]
Message-ID: <1232703989.6094.29.camel@penberg-laptop> (raw)
In-Reply-To: <1232699401.11429.163.camel@ymzhang>

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > > is about 10% better than SLUB's.
> > > > 
> > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > > 
> > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> > > counters patch to diagnose this:
> > > 
> > > http://lkml.org/lkml/2009/1/21/273
> > > 
> > > And do oprofile, of course. Thanks!
> > 
> > I assume binding the client and the server to different physical CPUs
> > also  means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> > 
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> > 
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.

That's bit surprising, actually. FWIW, I've included a patch for empty
slab lists. But it's probably not going to help here.

> >  there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
> 
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 132779   32.9951  vmlinux                  copy_user_generic_string
> 25334     6.2954  vmlinux                  schedule
> 21032     5.2264  vmlinux                  tg_shares_up
> 17175     4.2679  vmlinux                  __skb_recv_datagram
> 9091      2.2591  vmlinux                  sock_def_readable
> 8934      2.2201  vmlinux                  mwait_idle
> 8796      2.1858  vmlinux                  try_to_wake_up
> 6940      1.7246  vmlinux                  __slab_free
> 
> #slaninfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1643  5215544  5214027  94   0 
> kmalloc-8192                28  5189576  5189560   0   0 
> :0000168                  2631   141466   138976  92  28 
> :0004096                  1452    88697    87269  99  96 
> :0000192                  3402    63050    59732  89  11 
> :0000064                  6265    46611    40721  98  82 
> :0000128                  1895    30429    28654  93  32 

Looking at __slab_free(), unless page->inuse is constantly zero and we
discard the slab, it really is just cache effects (10% sounds like a
lot, though!). AFAICT, the only way to optimize that is with Christoph's
unfinished pointer freelists patches or with a remote free list like in
SLQB.

		Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3bd3662..41a4c1a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -48,6 +48,9 @@ struct kmem_cache_node {
 	unsigned long nr_partial;
 	unsigned long min_partial;
 	struct list_head partial;
+	unsigned long nr_empty;
+	unsigned long max_empty;
+	struct list_head empty;
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_t nr_slabs;
 	atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 8fad23f..5a12597 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,11 @@
  */
 #define MAX_PARTIAL 10
 
+/*
+ * Maximum number of empty slabs.
+ */
+#define MAX_EMPTY 1
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
 
@@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
 	free_slab(s, page);
 }
 
+static void discard_or_cache_slab(struct kmem_cache *s, struct page *page)
+{
+	struct kmem_cache_node *n;
+	int node;
+
+	node = page_to_nid(page);
+	n = get_node(s, node);
+
+	dec_slabs_node(s, node, page->objects);
+
+	if (likely(n->nr_empty >= n->max_empty)) {
+		free_slab(s, page);
+	} else {
+		n->nr_empty++;
+		list_add(&page->lru, &n->partial);
+	}
+}
+
 /*
  * Per slab locking using the pagelock
  */
@@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
 }
 
 /*
- * Lock slab and remove from the partial list.
+ * Lock slab and remove from the partial or empty list.
  *
  * Must hold list_lock.
  */
@@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 {
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
-		n->nr_partial--;
 		__SetPageSlubFrozen(page);
 		return 1;
 	}
@@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_or_empty_node(struct kmem_cache_node *n)
 {
 	struct page *page;
 
@@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
 	 * partial slab and there is none available then get_partials()
 	 * will return NULL.
 	 */
-	if (!n || !n->nr_partial)
+	if (!n || (!n->nr_partial && !n->nr_empty))
 		return NULL;
 
 	spin_lock(&n->list_lock);
+
 	list_for_each_entry(page, &n->partial, lru)
-		if (lock_and_freeze_slab(n, page))
+		if (lock_and_freeze_slab(n, page)) {
+			n->nr_partial--;
+			goto out;
+		}
+
+	list_for_each_entry(page, &n->empty, lru)
+		if (lock_and_freeze_slab(n, page)) {
+			n->nr_empty--;
 			goto out;
+		}
 	page = NULL;
 out:
 	spin_unlock(&n->list_lock);
@@ -1297,7 +1328,7 @@ out:
 /*
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > n->min_partial) {
-			page = get_partial_node(n);
+			page = get_partial_or_empty_node(n);
 			if (page)
 				return page;
 		}
@@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 }
 
 /*
- * Get a partial page, lock it and return it.
+ * Get a partial or empty page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *
+get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
-	page = get_partial_node(get_node(s, searchnode));
+	page = get_partial_or_empty_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
-	return get_any_partial(s, flags);
+	return get_any_partial_or_empty(s, flags);
 }
 
 /*
@@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 		} else {
 			slab_unlock(page);
 			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
-			discard_slab(s, page);
+			discard_or_cache_slab(s, page);
 		}
 	}
 }
@@ -1542,7 +1574,7 @@ another_slab:
 	deactivate_slab(s, c);
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
+	new = get_partial_or_empty(s, gfpflags, node);
 	if (new) {
 		c->page = new;
 		stat(c, ALLOC_FROM_PARTIAL);
@@ -1693,7 +1725,7 @@ slab_empty:
 	}
 	slab_unlock(page);
 	stat(c, FREE_SLAB);
-	discard_slab(s, page);
+	discard_or_cache_slab(s, page);
 	return;
 
 debug:
@@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
+	spin_lock_init(&n->list_lock);
+
 	n->nr_partial = 0;
 
 	/*
@@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 	else if (n->min_partial > MAX_PARTIAL)
 		n->min_partial = MAX_PARTIAL;
 
-	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
+
+	n->nr_empty = 0;
+	/*
+	 * XXX: This needs to take object size into account. We don't need
+	 * empty slabs for caches which will have plenty of partial slabs
+	 * available. Only caches that have either full or empty slabs need
+	 * this kind of optimization.
+	 */
+	n->max_empty = MAX_EMPTY;
+	INIT_LIST_HEAD(&n->empty);
+
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_set(&n->nr_slabs, 0);
 	atomic_long_set(&n->total_objects, 0);
@@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 	spin_unlock_irqrestore(&n->list_lock, flags);
 }
 
+static void free_empty_slabs(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+		struct page *page, *t;
+		unsigned long flags;
+
+		n = get_node(s, node);
+
+		if (!n->nr_empty)
+			continue;
+
+		spin_lock_irqsave(&n->list_lock, flags);
+
+		list_for_each_entry_safe(page, t, &n->empty, lru) {
+			list_del(&page->lru);
+			n->nr_empty--;
+
+			free_slab(s, page);
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+}
+
 /*
  * Release all resources used by a slab cache.
  */
@@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s)
 
 	flush_all(s);
 
+	free_empty_slabs(s);
+
 	/* Attempt to free all objects */
 	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s)
 		return -ENOMEM;
 
 	flush_all(s);
+	free_empty_slabs(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		n = get_node(s, node);
 

  parent reply	other threads:[~2009-01-23  9:46 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <BC02C49EEB98354DBA7F5DD76F2A9E800317003CB0@azsmsx501.amr.corp.intel.com>
     [not found] ` <CAC4B8726E86A142B27A9E9A2F2F12479D3E945F@rrsmsx505.amr.corp.intel.com>
2009-01-15  0:35   ` Mainline kernel OLTP performance update Andrew Morton
2009-01-15  1:21     ` Matthew Wilcox
2009-01-15  2:04       ` Andrew Morton
2009-01-15  2:27         ` Steven Rostedt
2009-01-15  7:11           ` Ma, Chinang
2009-01-19 18:04             ` Chris Mason
2009-01-19 18:37               ` Steven Rostedt
     [not found]               ` <1232390259.25783.5.camel@localhost.localdomain>
2009-01-19 18:55                 ` Chris Mason
2009-01-19 19:07                   ` Steven Rostedt
2009-01-19 23:40                 ` Ingo Molnar
2009-01-15  2:39         ` Andi Kleen
2009-01-15  2:47           ` Matthew Wilcox
2009-01-15  3:36             ` Andi Kleen
2009-01-20 13:27             ` Jens Axboe
     [not found]               ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com>
2009-01-22 11:29                 ` Jens Axboe
     [not found]                   ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com>
2009-01-27  8:28                     ` Jens Axboe
2009-01-15  7:24         ` Nick Piggin
2009-01-15  9:46           ` Pekka Enberg
2009-01-15 13:52             ` Matthew Wilcox
2009-01-15 14:42               ` Pekka Enberg
2009-01-16 10:16               ` Pekka Enberg
2009-01-16 10:21                 ` Nick Piggin
2009-01-16 10:31                   ` Pekka Enberg
2009-01-16 10:42                     ` Nick Piggin
2009-01-16 10:55                       ` Pekka Enberg
2009-01-19  7:13                         ` Nick Piggin
2009-01-19  8:05                           ` Pekka Enberg
2009-01-19  8:33                             ` Nick Piggin
2009-01-19  8:42                               ` Nick Piggin
2009-01-19  8:47                                 ` Pekka Enberg
2009-01-19  8:57                                   ` Nick Piggin
2009-01-19  9:48                               ` Pekka Enberg
2009-01-19 10:03                                 ` Nick Piggin
2009-01-16 20:59                     ` Christoph Lameter
2009-01-16  0:27           ` Andrew Morton
2009-01-16  4:03             ` Nick Piggin
2009-01-16  4:12               ` Andrew Morton
2009-01-16  6:46                 ` Nick Piggin
2009-01-16  6:55                   ` Matthew Wilcox
2009-01-16  7:06                     ` Nick Piggin
2009-01-16  7:53                     ` Zhang, Yanmin
2009-01-16 10:20                       ` Andi Kleen
2009-01-20  5:16                         ` Zhang, Yanmin
2009-01-21 23:58                           ` Christoph Lameter
2009-01-22  8:36                             ` Zhang, Yanmin
2009-01-22  9:15                               ` Pekka Enberg
2009-01-22  9:28                                 ` Zhang, Yanmin
2009-01-22  9:47                                   ` Pekka Enberg
2009-01-23  3:02                                     ` Zhang, Yanmin
2009-01-23  6:52                                       ` Pekka Enberg
2009-01-23  8:06                                         ` Pekka Enberg
2009-01-23  8:30                                           ` Zhang, Yanmin
2009-01-23  8:40                                             ` Pekka Enberg
2009-01-23  9:46                                             ` Pekka Enberg [this message]
2009-01-23 15:22                                               ` Christoph Lameter
2009-01-23 15:31                                                 ` Pekka Enberg
2009-01-23 15:55                                                   ` Christoph Lameter
2009-01-23 16:01                                                     ` Pekka Enberg
2009-01-24  2:55                                                 ` Zhang, Yanmin
2009-01-24  7:36                                                   ` Pekka Enberg
2009-02-12  5:22                                                     ` Zhang, Yanmin
2009-02-12  5:47                                                       ` Zhang, Yanmin
2009-02-12 15:25                                                         ` Christoph Lameter
2009-02-12 16:07                                                           ` Pekka Enberg
2009-02-12 16:03                                                         ` Pekka Enberg
2009-01-26 17:36                                                   ` Christoph Lameter
2009-02-01  2:52                                                     ` Zhang, Yanmin
2009-01-23  8:33                                       ` Nick Piggin
2009-01-23  9:02                                         ` Zhang, Yanmin
2009-01-23 18:40                                           ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
2009-01-23 18:51                                             ` Grant Grundler
2009-01-24  3:03                                             ` Zhang, Yanmin
2009-01-26 18:26                                               ` Rick Jones
2009-01-16  7:00                   ` Mainline kernel OLTP performance update Andrew Morton
2009-01-16  7:25                     ` Nick Piggin
2009-01-16  8:59                     ` Nick Piggin
2009-01-16 18:11                   ` Rick Jones
2009-01-19  7:43                     ` Nick Piggin
2009-01-19 22:19                       ` Rick Jones
2009-01-15 14:12         ` James Bottomley
2009-01-15 17:44           ` Andrew Morton
2009-01-15 18:00             ` Matthew Wilcox
2009-01-15 18:14               ` Steven Rostedt
2009-01-15 18:44                 ` Gregory Haskins
2009-01-15 18:46                   ` Wilcox, Matthew R
2009-01-15 19:44                     ` Ma, Chinang
2009-01-16 18:14                       ` Gregory Haskins
2009-01-16 19:09                         ` Steven Rostedt
2009-01-20 12:45                         ` Gregory Haskins
2009-01-15 19:28                 ` Ma, Chinang
2009-01-15 16:48       ` Ma, Chinang
     [not found] <D7C42C27E6CB1E4D8CBDF2F81EA92A260345836A3A@azsmsx501.amr.corp.intel.com>
2009-04-27  7:02 ` Andi Kleen
2009-04-28 16:57   ` Chuck Ebbert
2009-04-28 17:15     ` James Bottomley
2009-04-28 17:17       ` Styner, Douglas W

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1232703989.6094.29.camel@penberg-laptop \
    --to=penberg@cs.helsinki.fi \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=andrew.vasquez@qlogic.com \
    --cc=anirban.chakraborty@qlogic.com \
    --cc=arjan@linux.intel.com \
    --cc=chinang.ma@intel.com \
    --cc=chris.mason@oracle.com \
    --cc=cl@linux-foundation.org \
    --cc=douglas.w.styner@intel.com \
    --cc=harita.chilukuri@intel.com \
    --cc=hubert.nueckel@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=matthew@wil.cx \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=peter.xihong.wang@intel.com \
    --cc=sfr@canb.auug.org.au \
    --cc=sharad.c.tripathi@intel.com \
    --cc=srostedt@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=yanmin_zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox