[patch 00/19] Slab Fragmentation Reduction V13

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/19] Slab Fragmentation Reduction V13
@ 2008-05-10  2:21 Christoph Lameter
  2008-05-10  2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
                   ` (19 more replies)
  0 siblings, 20 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	mpm, Dave Chinner

V12->v13:
- Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
- Fix unitialized variable issue

Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim without much succes.
Slab defragmentation adds the capability to recover the memory that
is wasted.

Memory reclaim for the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

	updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.

However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.

V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
  use a timeout but track the amount of slab reclaim done by the shrinkers.
  Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
  /proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
  defrag (Either of those could be theoretically equipped to support
  slab defrag in some way but it seems that Andrew/Linus want to reduce
  the number of slab allocators).

V10->V11
- Simplify determination when to reclaim: Just scan over all partials
  and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
  were calling into reclaim very frequently when the system was under
  memory pressure which slowed things down. Various measures to
  avoid scanning the partial list too frequently were added and the
  earlier (expensive) method of determining the defrag ratio of the slab
  cache as a whole was dropped. I think this addresses the issues that
  Mel saw with V10.

V9->V10
- Rediff against upstream

V8->V9
- Rediff against 2.6.24-rc6-mm1

V7->V8
- Rediff against 2.6.24-rc3-mm2

V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
  handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
  Targeted reclaim never triggers. This has to wait until we make
  slabs movable or we need to perform a special version of lumpy reclaim
  in SLUB while we scan the partial lists for slabs to kick out.
  Removal simplifies handling significantly since we
  get to slabs in a more controlled way via the partial lists.
  The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
  has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 01/19] slub: Add defrag_ratio field and sysfs support.
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0001-SLUB-Add-defrag_ratio-field-and-sysfs-support.patch --]
[-- Type: text/plain, Size: 2560 bytes --]

The defrag_ratio is used to set the threshold at which defragmentation
should be attempted on a slab page.

The allocation ratio is measured by the percentage of the available slots
allocated.

Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
that less than 3 out of 10 available slots for objects are in use before
slab defragmeentation runs.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    7 +++++++
 mm/slub.c                |   23 +++++++++++++++++++++++
 2 files changed, 30 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-07-31 12:20:16.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2008-07-31 12:20:17.000000000 -0500
@@ -88,6 +88,13 @@
 	void (*ctor)(void *);
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
+	int defrag_ratio;	/*
+				 * Ratio used to check the percentage of
+				 * objects allocate in a slab page.
+				 * If less than this ratio is allocated
+				 * then reclaim attempts are made.
+				 */
+
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:20:16.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:20:17.000000000 -0500
@@ -2299,6 +2299,7 @@
 		goto error;
 
 	s->refcount = 1;
+	s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 100;
 #endif
@@ -4031,6 +4032,27 @@
 }
 SLAB_ATTR_RO(free_calls);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = strict_strtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio < 100)
+		s->defrag_ratio = ratio;
+	return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_NUMA
 static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -4138,6 +4160,7 @@
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/*
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
  2008-05-10  2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 03/19] slub: Add get() and kick() methods Christoph Lameter
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0002-SLUB-Replace-ctor-field-with-ops-field-in-sys-slab.patch --]
[-- Type: text/plain, Size: 1463 bytes --]

Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that will
be defined soon.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:19:51.000000000 -0500
@@ -3803,16 +3803,18 @@
 }
 SLAB_ATTR(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -4145,7 +4147,7 @@
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 03/19] slub: Add get() and kick() methods
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
  2008-05-10  2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
  2008-05-10  2:21 ` [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0003-SLUB-Add-get-and-kick-methods.patch --]
[-- Type: text/plain, Size: 5458 bytes --]

Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the defrag methods for a slab cache.

Add empty functions for SLAB/SLOB. The API is generic so it
could be theoretically implemented for either allocator.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h     |   50 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/slub_def.h |    3 ++
 mm/slub.c                |   29 ++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2008-07-31 12:19:39.000000000 -0500
@@ -86,6 +86,9 @@
 	gfp_t allocflags;	/* gfp flags to use on each alloc */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *);
+	kmem_defrag_get_func *get;
+	kmem_defrag_kick_func *kick;
+
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	int defrag_ratio;	/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:19:48.000000000 -0500
@@ -2736,6 +2736,19 @@
 }
 EXPORT_SYMBOL(kfree);
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
+{
+	/*
+	 * Defragmentable slabs must have a ctor otherwise objects may be
+	 * in an undetermined state after they are allocated.
+	 */
+	BUG_ON(!s->ctor);
+	s->get = get;
+	s->kick = kick;
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3029,7 +3042,7 @@
 	if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor)
+	if (s->ctor || s->kick || s->get)
 		return 1;
 
 	/*
@@ -3812,6 +3825,20 @@
 		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
 		x += sprintf(buf + x, "\n");
 	}
+
+	if (s->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2008-07-31 12:19:25.000000000 -0500
+++ linux-2.6/include/linux/slab.h	2008-07-31 12:19:45.000000000 -0500
@@ -102,6 +102,56 @@
 size_t ksize(const void *);
 
 /*
+ * Function prototypes passed to kmem_cache_defrag() to enable defragmentation
+ * and targeted reclaim in slab caches.
+ */
+
+/*
+ * kmem_cache_defrag_get_func() is called with locks held so that the slab
+ * objects cannot be freed. We are in an atomic context and no slab
+ * operations may be performed. The purpose of kmem_cache_defrag_get_func()
+ * is to obtain a stable refcount on the objects, so that they cannot be
+ * removed until kmem_cache_kick_func() has handled them.
+ *
+ * Parameters passed are the number of objects to process and an array of
+ * pointers to objects for which we need references.
+ *
+ * Returns a pointer that is passed to the kick function. If any objects
+ * cannot be moved then the pointer may indicate a failure and
+ * then kick can simply remove the references that were already obtained.
+ *
+ * The object pointer array passed is also passed to kmem_cache_defrag_kick().
+ * The function may remove objects from the array by setting pointers to
+ * NULL. This is useful if we can determine that an object is already about
+ * to be removed. In that case it is often impossible to obtain the necessary
+ * refcount.
+ */
+typedef void *kmem_defrag_get_func(struct kmem_cache *, int, void **);
+
+/*
+ * kmem_cache_defrag_kick_func is called with no locks held and interrupts
+ * enabled. Sleeping is possible. Any operation may be performed in kick().
+ * kmem_cache_defrag should free all the objects in the pointer array.
+ *
+ * Parameters passed are the number of objects in the array, the array of
+ * pointers to the objects and the pointer returned by kmem_cache_defrag_get().
+ *
+ * Success is checked by examining the number of remaining objects in the slab.
+ */
+typedef void kmem_defrag_kick_func(struct kmem_cache *, int, void **, void *);
+
+/*
+ * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
+						kmem_defrag_kick_func);
+#else
+static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+#endif
+
+/*
  * Allocator specific definitions. These are mainly used to establish optimized
  * ways to convert kmalloc() calls to kmem_cache_alloc() invocations by
  * selecting the appropriate general cache at compile time.

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-05-10  2:21 ` [patch 03/19] slub: Add get() and kick() methods Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 05/19] slub: Slab defrag core Christoph Lameter
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0004-SLUB-Sort-slab-cache-list-and-establish-maximum-obj.patch --]
[-- Type: text/plain, Size: 2624 bytes --]

When defragmenting slabs then it is advantageous to have all
defragmentable slabs together at the beginning of the list so that there is
no need to scan the complete list. Put defragmentable caches first when adding
a slab cache and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
to size the allocation of arrays holding refs to these objects later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:19:45.000000000 -0500
@@ -173,6 +173,9 @@
 static DECLARE_RWSEM(slub_lock);
 static LIST_HEAD(slab_caches);
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects;
+
 /*
  * Tracking user of a slab.
  */
@@ -2506,7 +2509,7 @@
 								flags, NULL))
 		goto panic;
 
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 	up_write(&slub_lock);
 	if (sysfs_slab_add(s))
 		goto panic;
@@ -2736,9 +2739,23 @@
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Allocate a slab scratch space that is sufficient to keep at least
+ * max_defrag_slab_objects pointers to individual objects and also a bitmap
+ * for max_defrag_slab_objects.
+ */
+static inline void *alloc_scratch(void)
+{
+	return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+		BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+		GFP_KERNEL);
+}
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
 {
+	int max_objects = oo_objects(s->max);
+
 	/*
 	 * Defragmentable slabs must have a ctor otherwise objects may be
 	 * in an undetermined state after they are allocated.
@@ -2746,6 +2763,11 @@
 	BUG_ON(!s->ctor);
 	s->get = get;
 	s->kick = kick;
+	down_write(&slub_lock);
+	list_move(&s->list, &slab_caches);
+	if (max_objects > max_defrag_slab_objects)
+		max_defrag_slab_objects = max_objects;
+	up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
@@ -3131,7 +3153,7 @@
 	if (s) {
 		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor)) {
-			list_add(&s->list, &slab_caches);
+			list_add_tail(&s->list, &slab_caches);
 			up_write(&slub_lock);
 			if (sysfs_slab_add(s))
 				goto err;

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 05/19] slub: Slab defrag core
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-05-10  2:21 ` [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0005-SLUB-Slab-defrag-core.patch --]
[-- Type: text/plain, Size: 12916 bytes --]

Slab defragmentation may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Through the use of the slabinfo command.

3. Per node defrag conditionally when kmem_cache_defrag(<node>) is called
   (can be called from reclaim code with a later patch).

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab page can accomodate.

   The scanning of slab caches is optimized because the
   defragmentable slabs come first on the list. Thus we can terminate scans
   on the first slab encountered that does not support defragmentation.

   kmem_cache_defrag() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are

kmem_defrag_get_func (void *get(struct kmem_cache *s, int nr, void **objects))

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free() attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect such situations and void the attempts to free such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page that contains the object is taken. Any attempt to perform
	a slab operation may lead to a deadlock.

	kmem_defrag_get_func returns a private pointer that is passed to
	kmem_defrag_kick_func(). Should we be unable to obtain all references
	then that pointer may indicate to the kick() function that it should
	not attempt any object removal or move but simply remove the
	reference counts.

kmem_defrag_kick_func (void kick(struct kmem_cache *, int nr, void **objects,
							void *get_result))

	After SLUB has established references to the objects in a
	slab it will then drop all locks and use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via kmem_defrag_get_func(). The
	callback may perform any slab operation since no locks are held at
	the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab. A filled up slab can also be removed
	from the partial list. So there could be a double effect.

	kmem_defrag_kick_func() does not return a result. SLUB will check
	the number of remaining objects in the slab. If all objects were
	removed then the operation was successful.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h |    3 
 mm/slub.c            |  265 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 215 insertions(+), 53 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:19:42.000000000 -0500
@@ -127,10 +127,10 @@
 
 /*
  * Maximum number of desirable partial slabs.
- * The existence of more partial slabs makes kmem_cache_shrink
- * sort the partial list by the number of objects in the.
+ * More slabs cause kmem_cache_shrink to sort the slabs by objects
+ * and triggers slab defragmentation.
  */
-#define MAX_PARTIAL 10
+#define MAX_PARTIAL 20
 
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
@@ -2772,76 +2772,235 @@
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * Vacate all objects in the given slab.
  *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
  */
-int kmem_cache_shrink(struct kmem_cache *s)
+static int kmem_cache_vacate(struct page *page, void *scratch)
 {
-	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	struct kmem_cache *s;
+	unsigned long *map;
+	int leftover;
+	int count;
+	void *private;
 	unsigned long flags;
+	unsigned long objects;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	local_irq_save(flags);
+	slab_lock(page);
 
-	flush_all(s);
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		n = get_node(s, node);
+	BUG_ON(!PageSlab(page));	/* Must be s slab page */
+	BUG_ON(!SlabFrozen(page));	/* Slab must have been frozen earlier */
+
+	s = page->slab;
+	objects = page->objects;
+	map = scratch + objects * sizeof(void **);
+	if (!page->inuse || !s->kick)
+		goto out;
+
+	/* Determine used objects */
+	bitmap_fill(map, objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
+
+	/* Build vector of pointers to objects */
+	count = 0;
+	memset(vector, 0, objects * sizeof(void **));
+	for_each_object(p, s, addr, objects)
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[count++] = p;
+
+	private = s->get(s, count, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->kick(s, count, vector, private);
+
+	local_irq_save(flags);
+	slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page, leftover > 0);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Remove objects from a list of slab pages that have been gathered.
+ * Must be called with slabs that have been isolated before.
+ *
+ * kmem_cache_reclaim() is never called from an atomic context. It
+ * allocates memory for temporary storage. We are holding the
+ * slub_lock semaphore which prevents another call into
+ * the defrag logic.
+ */
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+	int freed = 0;
+	void **scratch;
+	struct page *page;
+	struct page *page2;
+
+	if (list_empty(zaplist))
+		return 0;
+
+	scratch = alloc_scratch();
+	if (!scratch)
+		return 0;
+
+	list_for_each_entry_safe(page, page2, zaplist, lru) {
+		list_del(&page->lru);
+		if (kmem_cache_vacate(page, scratch) == 0)
+			freed++;
+	}
+	kfree(scratch);
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ * by releasing slabs with zero objects and trying to reclaim
+ * slabs with less than the configured percentage of objects allocated.
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s, int node,
+							unsigned long limit)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed = 0;
+	struct kmem_cache_node *n = get_node(s, node);
 
-		if (!n->nr_partial)
+	if (n->nr_partial <= limit)
+		return 0;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+		if (page->inuse) {
+			if (page->inuse * 100 >=
+					s->defrag_ratio * page->objects) {
+				slab_unlock(page);
+				/* Slab contains enough objects */
+				continue;
+			}
 
-		spin_lock_irqsave(&n->list_lock, flags);
+			list_move(&page->lru, &zaplist);
+			if (s->kick) {
+				n->nr_partial--;
+				SetSlabFrozen(page);
+			}
+			slab_unlock(page);
+		} else {
+			/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		}
+	}
 
+	if (!s->kick)
 		/*
-		 * Build lists indexed by the items in use in each slab.
+		 * No defrag methods. By simply putting the zaplist at the
+		 * end of the partial list we can let them simmer longer
+		 * and thus increase the chance of all objects being
+		 * reclaimed.
 		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
+		 * We have effectively sorted the partial list and put
+		 * the slabs with more objects first. As soon as they
+		 * are allocated they are going to be removed from the
+		 * partial list.
 		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
-			}
-		}
+		list_splice(&zaplist, n->partial.prev);
+
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->kick)
+		freed += kmem_cache_reclaim(&zaplist);
+
+	return freed;
+}
+
+/*
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long slabs = 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		unsigned long reclaimed = 0;
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * Defragmentable caches come first. If the slab cache is not
+		 * defragmentable then we can stop traversing the list.
 		 */
-		for (i = objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (!s->kick)
+			break;
 
-		spin_unlock_irqrestore(&n->list_lock, flags);
+		if (node == -1) {
+			int nid;
+
+			for_each_node_state(nid, N_NORMAL_MEMORY)
+				reclaimed += __kmem_cache_shrink(s, nid,
+								MAX_PARTIAL);
+		} else
+			reclaimed = __kmem_cache_shrink(s, node, MAX_PARTIAL);
+
+		slabs += reclaimed;
 	}
+	up_read(&slub_lock);
+	return slabs;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists.
+ * If the slab cache supports defragmentation then objects are
+ * reclaimed.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		__kmem_cache_shrink(s, node, 0);
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/include/linux/slab.h	2008-07-31 12:19:28.000000000 -0500
@@ -142,13 +142,16 @@
 
 /*
  * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ * kmem_cache_defrag() performs the actual defragmentation.
  */
 #ifdef CONFIG_SLUB
 void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
 						kmem_defrag_kick_func);
+int kmem_cache_defrag(int node);
 #else
 static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+static inline int kmem_cache_defrag(int node) { return 0; }
 #endif
 
 /*

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-05-10  2:21 ` [patch 05/19] slub: Slab defrag core Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 07/19] slub: Extend slabinfo to support -D and -F options Christoph Lameter
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0006-SLUB-Add-KICKABLE-to-avoid-repeated-kick-attempts.patch --]
[-- Type: text/plain, Size: 3530 bytes --]

Add a flag KICKABLE to be set on slabs with a defragmentation method

Clear the flag if a kick action is not successful in reducing the
number of objects in a slab. This will avoid future attempts to
kick objects out.

The KICKABLE flag is set again when all objects of the slab have been
allocated (Occurs during removal of a slab from the partial lists).

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:19:39.000000000 -0500
@@ -1130,6 +1130,9 @@
 			SLAB_STORE_USER | SLAB_TRACE))
 		__SetPageSlubDebug(page);
 
+	if (s->kick)
+		__SetPageSlubKickable(page);
+
 	start = page_address(page);
 
 	if (unlikely(s->flags & SLAB_POISON))
@@ -1170,6 +1173,7 @@
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	__ClearPageSlubKickable(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	__free_pages(page, order);
@@ -1380,6 +1384,8 @@
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
+			if (s->kick)
+				__SetPageSlubKickable(page);
 		}
 		slab_unlock(page);
 	} else {
@@ -2795,12 +2801,12 @@
 	slab_lock(page);
 
 	BUG_ON(!PageSlab(page));	/* Must be s slab page */
-	BUG_ON(!SlabFrozen(page));	/* Slab must have been frozen earlier */
+	BUG_ON(!PageSlubFrozen(page));	/* Slab must have been frozen earlier */
 
 	s = page->slab;
 	objects = page->objects;
 	map = scratch + objects * sizeof(void **);
-	if (!page->inuse || !s->kick)
+	if (!page->inuse || !s->kick || !PageSlubKickable(page))
 		goto out;
 
 	/* Determine used objects */
@@ -2838,6 +2844,9 @@
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
+	if (leftover)
+		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+		__ClearPageSlubKickable(page);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -2899,17 +2908,21 @@
 			continue;
 
 		if (page->inuse) {
-			if (page->inuse * 100 >=
+			if (!PageSlubKickable(page) || page->inuse * 100 >=
 					s->defrag_ratio * page->objects) {
 				slab_unlock(page);
-				/* Slab contains enough objects */
+				/*
+				 * Slab contains enough objects
+				 * or we alrady tried reclaim before and
+				 * it failed. Skip this one.
+				 */
 				continue;
 			}
 
 			list_move(&page->lru, &zaplist);
 			if (s->kick) {
 				n->nr_partial--;
-				SetSlabFrozen(page);
+				__SetPageSlubFrozen(page);
 			}
 			slab_unlock(page);
 		} else {
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-07-31 12:19:25.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h	2008-07-31 12:19:28.000000000 -0500
@@ -112,6 +112,7 @@
 	/* SLUB */
 	PG_slub_frozen = PG_active,
 	PG_slub_debug = PG_error,
+	PG_slub_kickable = PG_dirty,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -182,6 +183,7 @@
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
 __PAGEFLAG(SlubDebug, slub_debug)
+__PAGEFLAG(SlubKickable, slub_kickable)
 
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 07/19] slub: Extend slabinfo to support -D and -F options
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-05-10  2:21 ` [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 08/19] slub/slabinfo: add defrag statistics Christoph Lameter
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0007-SLUB-Extend-slabinfo-to-support-D-and-F-options.patch --]
[-- Type: text/plain, Size: 5707 bytes --]

-F lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/vm/slabinfo.c |   48 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

Index: linux-next/Documentation/vm/slabinfo.c
===================================================================
--- linux-next.orig/Documentation/vm/slabinfo.c	2008-07-09 09:06:12.000000000 -0500
+++ linux-next/Documentation/vm/slabinfo.c	2008-07-09 09:33:37.000000000 -0500
@@ -31,6 +31,8 @@
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
+	int defrag_ratio, remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
 	unsigned long free_fastpath, free_slowpath;
@@ -64,6 +66,8 @@
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -100,13 +104,15 @@
 void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
 		"-A|--activity          Most active slabs first\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-D|--display-active    Switch line format to activity\n"
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
+		"-F|--defrag            Show defragmentable caches\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -296,7 +302,7 @@
 		printf("Name                   Objects      Alloc       Free   %%Fast Fallb O\n");
 	else
 		printf("Name                   Objects Objsize    Space "
-			"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+			"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -345,7 +351,7 @@
 		return;
 
 	if (!line) {
-		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
 			printf(" %4d", node);
 		printf("\n----------------------");
@@ -354,6 +360,7 @@
 		printf("\n");
 	}
 	printf("%-21s ", mode ? "All slabs" : s->name);
+	printf("%3d ", s->remote_node_defrag_ratio);
 	for(node = 0; node <= highest_node; node++) {
 		char b[20];
 
@@ -492,6 +499,8 @@
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->defrag)
+		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -539,6 +548,12 @@
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
 						s->partial, s->cpu_slabs);
@@ -550,6 +565,10 @@
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'F';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -584,7 +603,8 @@
 		printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
 			s->objs_per_slab, s->order,
-			s->slabs ? (s->partial * 100) / s->slabs : 100,
+			s->slabs ? (s->partial * 100) /
+					(s->slabs * s->objs_per_slab) : 100,
 			s->slabs ? (s->objects * s->object_size * 100) /
 				(s->slabs * (page_size << s->order)) : 100,
 			flags);
@@ -1190,7 +1210,17 @@
 			slab->deactivate_to_tail = get_obj("deactivate_to_tail");
 			slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
 			slab->order_fallback = get_obj("order_fallback");
+			slab->defrag_ratio = get_obj("defrag_ratio");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
+
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1241,10 +1271,12 @@
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
 	{ "activity", 0, NULL, 'A' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
 	{ "display-activity", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
+	{ "defrag", 0, NULL, 'F' },
 	{ "help", 0, NULL, 'h' },
 	{ "inverted", 0, NULL, 'i'},
 	{ "numa", 0, NULL, 'n' },
@@ -1267,7 +1299,7 @@
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1323,6 +1355,12 @@
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'F':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 08/19] slub/slabinfo: add defrag statistics
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (6 preceding siblings ...)
  2008-05-10  2:21 ` [patch 07/19] slub: Extend slabinfo to support -D and -F options Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 09/19] slub: Trigger defragmentation from memory reclaim Christoph Lameter
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0008-slub-add-defrag-statistics.patch --]
[-- Type: text/plain, Size: 8843 bytes --]

Add statistics counters for slab defragmentation.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/vm/slabinfo.c |   45 ++++++++++++++++++++++++++++++++++++--------
 include/linux/slub_def.h    |    6 +++++
 mm/slub.c                   |   29 ++++++++++++++++++++++++++--
 3 files changed, 70 insertions(+), 10 deletions(-)

Index: linux-2.6/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.orig/Documentation/vm/slabinfo.c	2008-07-31 12:18:58.000000000 -0500
+++ linux-2.6/Documentation/vm/slabinfo.c	2008-07-31 12:18:58.000000000 -0500
@@ -41,6 +41,9 @@
 	unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
 	unsigned long deactivate_to_head, deactivate_to_tail;
 	unsigned long deactivate_remote_frees, order_fallback;
+	unsigned long shrink_calls, shrink_attempt_defrag, shrink_empty_slab;
+	unsigned long shrink_slab_skipped, shrink_slab_reclaimed;
+	unsigned long shrink_object_reclaim_failed;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
 } slabinfo[MAX_SLABS];
@@ -79,6 +82,7 @@
 int set_debug = 0;
 int show_ops = 0;
 int show_activity = 0;
+int show_defragcount = 0;
 
 /* Debug options */
 int sanity = 0;
@@ -113,6 +117,7 @@
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-F|--defrag            Show defragmentable caches\n"
+		"-G:--display-defrag    Display defrag counters\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -300,6 +305,8 @@
 {
 	if (show_activity)
 		printf("Name                   Objects      Alloc       Free   %%Fast Fallb O\n");
+	else if (show_defragcount)
+		printf("Name                   Objects DefragRQ  Slabs Success   Empty Skipped  Failed\n");
 	else
 		printf("Name                   Objects Objsize    Space "
 			"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
@@ -466,22 +473,28 @@
 
 	printf("Total                %8lu %8lu\n\n", total_alloc, total_free);
 
-	if (s->cpuslab_flush)
-		printf("Flushes %8lu\n", s->cpuslab_flush);
-
-	if (s->alloc_refill)
-		printf("Refill %8lu\n", s->alloc_refill);
+	if (s->cpuslab_flush || s->alloc_refill)
+		printf("CPU Slab  : Flushes=%lu Refills=%lu\n",
+			s->cpuslab_flush, s->alloc_refill);
 
 	total = s->deactivate_full + s->deactivate_empty +
 			s->deactivate_to_head + s->deactivate_to_tail;
 
 	if (total)
-		printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
+		printf("Deactivate: Full=%lu(%lu%%) Empty=%lu(%lu%%) "
 			"ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
 			s->deactivate_full, (s->deactivate_full * 100) / total,
 			s->deactivate_empty, (s->deactivate_empty * 100) / total,
 			s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
 			s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
+
+	if (s->shrink_calls)
+		printf("Shrink    : Calls=%lu Attempts=%lu Empty=%lu Successful=%lu\n",
+			s->shrink_calls, s->shrink_attempt_defrag,
+			s->shrink_empty_slab, s->shrink_slab_reclaimed);
+	if (s->shrink_slab_skipped || s->shrink_object_reclaim_failed)
+		printf("Defrag    : Slabs skipped=%lu Object reclaim failed=%lu\n",
+		s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
 }
 
 void report(struct slabinfo *s)
@@ -598,7 +611,12 @@
 			total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
 			total_free ? (s->free_fastpath * 100 / total_free) : 0,
 			s->order_fallback, s->order);
-	}
+	} else
+	if (show_defragcount)
+		printf("%-21s %8ld %7d %7d %7d %7d %7d %7d\n",
+			s->name, s->objects, s->shrink_calls, s->shrink_attempt_defrag,
+			s->shrink_slab_reclaimed, s->shrink_empty_slab,
+			s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
 	else
 		printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
@@ -1210,6 +1228,13 @@
 			slab->deactivate_to_tail = get_obj("deactivate_to_tail");
 			slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
 			slab->order_fallback = get_obj("order_fallback");
+			slab->shrink_calls = get_obj("shrink_calls");
+			slab->shrink_attempt_defrag = get_obj("shrink_attempt_defrag");
+			slab->shrink_empty_slab = get_obj("shrink_empty_slab");
+			slab->shrink_slab_skipped = get_obj("shrink_slab_skipped");
+			slab->shrink_slab_reclaimed = get_obj("shrink_slab_reclaimed");
+			slab->shrink_object_reclaim_failed =
+					get_obj("shrink_object_reclaim_failed");
 			slab->defrag_ratio = get_obj("defrag_ratio");
 			slab->remote_node_defrag_ratio =
 					get_obj("remote_node_defrag_ratio");
@@ -1274,6 +1299,7 @@
 	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
 	{ "display-activity", 0, NULL, 'D' },
+	{ "display-defrag", 0, NULL, 'G' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "defrag", 0, NULL, 'F' },
@@ -1299,7 +1325,7 @@
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "aACd::DefFGhil1noprstvzTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1325,6 +1351,9 @@
 		case 'f':
 			show_first_alias = 1;
 			break;
+		case 'G':
+			show_defragcount = 1;
+			break;
 		case 'h':
 			usage();
 			return 0;
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-07-31 12:18:58.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2008-07-31 12:18:58.000000000 -0500
@@ -30,6 +30,12 @@
 	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
+	SHRINK_CALLS,		/* Number of invocations of kmem_cache_shrink */
+	SHRINK_ATTEMPT_DEFRAG,	/* Slabs that were attempted to be reclaimed */
+	SHRINK_EMPTY_SLAB,	/* Shrink encountered and freed empty slab */
+	SHRINK_SLAB_SKIPPED,	/* Slab reclaim skipped an slab (busy etc) */
+	SHRINK_SLAB_RECLAIMED,	/* Successfully reclaimed slabs */
+	SHRINK_OBJECT_RECLAIM_FAILED, /* Callbacks signaled busy objects */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-07-31 12:18:58.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-07-31 12:18:58.000000000 -0500
@@ -2796,6 +2796,7 @@
 	void *private;
 	unsigned long flags;
 	unsigned long objects;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
 	slab_lock(page);
@@ -2844,9 +2845,13 @@
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
-	if (leftover)
+	c = get_cpu_slab(s, smp_processor_id());
+	if (leftover) {
 		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+		stat(c, SHRINK_OBJECT_RECLAIM_FAILED);
 		__ClearPageSlubKickable(page);
+	} else
+		stat(c, SHRINK_SLAB_RECLAIMED);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -2897,11 +2902,14 @@
 	LIST_HEAD(zaplist);
 	int freed = 0;
 	struct kmem_cache_node *n = get_node(s, node);
+	struct kmem_cache_cpu *c;
 
 	if (n->nr_partial <= limit)
 		return 0;
 
 	spin_lock_irqsave(&n->list_lock, flags);
+	c = get_cpu_slab(s, smp_processor_id());
+	stat(c, SHRINK_CALLS);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
 		if (!slab_trylock(page))
 			/* Busy slab. Get out of the way */
@@ -2921,12 +2929,14 @@
 
 			list_move(&page->lru, &zaplist);
 			if (s->kick) {
+				stat(c, SHRINK_ATTEMPT_DEFRAG);
 				n->nr_partial--;
 				__SetPageSlubFrozen(page);
 			}
 			slab_unlock(page);
 		} else {
 			/* Empty slab page */
+			stat(c, SHRINK_EMPTY_SLAB);
 			list_del(&page->lru);
 			n->nr_partial--;
 			slab_unlock(page);
@@ -4355,6 +4365,12 @@
 STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(SHRINK_CALLS, shrink_calls);
+STAT_ATTR(SHRINK_ATTEMPT_DEFRAG, shrink_attempt_defrag);
+STAT_ATTR(SHRINK_EMPTY_SLAB, shrink_empty_slab);
+STAT_ATTR(SHRINK_SLAB_SKIPPED, shrink_slab_skipped);
+STAT_ATTR(SHRINK_SLAB_RECLAIMED, shrink_slab_reclaimed);
+STAT_ATTR(SHRINK_OBJECT_RECLAIM_FAILED, shrink_object_reclaim_failed);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -4409,6 +4425,12 @@
 	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&order_fallback_attr.attr,
+	&shrink_calls_attr.attr,
+	&shrink_attempt_defrag_attr.attr,
+	&shrink_empty_slab_attr.attr,
+	&shrink_slab_skipped_attr.attr,
+	&shrink_slab_reclaimed_attr.attr,
+	&shrink_object_reclaim_failed_attr.attr,
 #endif
 	NULL
 };

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 09/19] slub: Trigger defragmentation from memory reclaim
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (7 preceding siblings ...)
  2008-05-10  2:21 ` [patch 08/19] slub/slabinfo: add defrag statistics Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 10/19] buffer heads: Support slab defrag Christoph Lameter
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0009-SLUB-Trigger-defragmentation-from-memory-reclaim.patch --]
[-- Type: text/plain, Size: 10466 bytes --]

This patch triggers slab defragmentation from memory reclaim. The logical
point for this is after slab shrinking was performed in vmscan.c. At that point
the fragmentation ratio of a slab was increased because objects were freed via
the LRU lists maitained for various slab caches.
So we call kmem_cache_defrag() from there.

shrink_slab() is called in some contexts to do global shrinking
of slabs and in others to do shrinking for a particular zone. Pass the zone to
shrink_slab(), so that slab_shrink() can call kmem_cache_defrag() and restrict
the defragmentation to the node that is under memory pressure.

The callback frequency into slab reclaim can be controlled by a new field
/proc/sys/vm/slab_defrag_limit.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/sysctl/vm.txt |   12 ++++++++
 fs/drop_caches.c            |    2 -
 include/linux/mm.h          |    3 --
 include/linux/mmzone.h      |    1 
 include/linux/swap.h        |    3 ++
 kernel/sysctl.c             |   20 +++++++++++++
 mm/vmscan.c                 |   65 +++++++++++++++++++++++++++++++++++++++-----
 mm/vmstat.c                 |    2 +
 8 files changed, 98 insertions(+), 10 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/fs/drop_caches.c	2008-07-31 12:18:58.000000000 -0500
@@ -58,7 +58,7 @@
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2008-07-31 12:18:58.000000000 -0500
@@ -1283,8 +1283,7 @@
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
-
+				unsigned long lru_pages, struct zone *z);
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
 #else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/mm/vmscan.c	2008-07-31 12:18:58.000000000 -0500
@@ -150,6 +150,14 @@
 EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
+
+/*
+ * Trigger a call into slab defrag if the sum of the returns from
+ * shrinkers cross this value.
+ */
+int slab_defrag_limit = 1000;
+int slab_defrag_counter;
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -167,10 +175,18 @@
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -227,6 +243,39 @@
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+
+
+	/* Avoid dirtying cachelines */
+	if (!ret)
+		return 0;
+
+	/*
+	 * "ret" doesnt really contain the freed object count. The shrinkers
+	 * fake it. Gotta go with what we are getting though.
+	 *
+	 * Handling of the defrag_counter is also racy. If we get the
+	 * wrong counts then we may unnecessarily do a defrag pass or defer
+	 * one. "ret" is already faked. So this is just increasing
+	 * the already existing fuzziness to get some notion as to when
+	 * to initiate slab defrag which will hopefully be okay.
+	 */
+	if (zone) {
+		/* balance_pgdat running on a zone so we only scan one node */
+		zone->slab_defrag_counter += ret;
+		if (zone->slab_defrag_counter > slab_defrag_limit &&
+						(gfp_mask & __GFP_FS)) {
+			zone->slab_defrag_counter = 0;
+			kmem_cache_defrag(zone_to_nid(zone));
+		}
+	} else {
+		/* Direct (and thus global) reclaim. Scan all nodes */
+		slab_defrag_counter += ret;
+		if (slab_defrag_counter > slab_defrag_limit &&
+						(gfp_mask & __GFP_FS)) {
+			slab_defrag_counter = 0;
+			kmem_cache_defrag(-1);
+		}
+	}
 	return ret;
 }
 
@@ -1379,7 +1428,7 @@
 		 * over limit cgroups
 		 */
 		if (scan_global_lru(sc)) {
-			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
+			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages, NULL);
 			if (reclaim_state) {
 				nr_reclaimed += reclaim_state->reclaimed_slab;
 				reclaim_state->reclaimed_slab = 0;
@@ -1606,7 +1655,7 @@
 				nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone_is_all_unreclaimable(zone))
@@ -1845,7 +1894,7 @@
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
 		reclaim_state.reclaimed_slab = 0;
-		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
 		if (!reclaim_state.reclaimed_slab)
 			break;
 
@@ -1883,7 +1932,7 @@
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1900,7 +1949,7 @@
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
@@ -2062,7 +2111,8 @@
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h	2008-07-31 12:18:58.000000000 -0500
@@ -256,6 +256,7 @@
 	unsigned long		nr_scan_active;
 	unsigned long		nr_scan_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
+	unsigned long		slab_defrag_counter; /* since last defrag */
 	unsigned long		flags;		   /* zone flags, see below */
 
 	/* Zone statistics */
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/include/linux/swap.h	2008-07-31 12:18:58.000000000 -0500
@@ -188,6 +188,9 @@
 extern int __isolate_lru_page(struct page *page, int mode);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
+extern int slab_defrag_limit;
+extern int slab_defrag_counter;
+
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/kernel/sysctl.c	2008-07-31 12:18:58.000000000 -0500
@@ -1071,6 +1071,26 @@
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_limit",
+		.data		= &slab_defrag_limit,
+		.maxlen		= sizeof(slab_defrag_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one_hundred,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_count",
+		.data		= &slab_defrag_counter,
+		.maxlen		= sizeof(slab_defrag_counter),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 	{
 		.ctl_name	= VM_LEGACY_VA_LAYOUT,
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/Documentation/sysctl/vm.txt	2008-07-31 12:18:58.000000000 -0500
@@ -38,6 +38,7 @@
 - numa_zonelist_order
 - nr_hugepages
 - nr_overcommit_hugepages
+- slab_defrag_limit
 
 ==============================================================
 
@@ -347,3 +348,14 @@
 nr_hugepages + nr_overcommit_hugepages.
 
 See Documentation/vm/hugetlbpage.txt
+
+==============================================================
+
+slab_defrag_limit
+
+Determines the frequency of calls from reclaim into slab defragmentation.
+Slab defrag reclaims objects from sparsely populates slab pages.
+The default is 1000. Increase if slab defragmentation occurs
+too frequently. Decrease if more slab defragmentation passes
+are needed. The slabinfo tool can report on the frequency of the callbacks.
+
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2008-07-31 12:18:58.000000000 -0500
@@ -714,10 +714,12 @@
 #endif
 	}
 	seq_printf(m,
+		   "\n  slab_defrag_count: %lu"
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu",
-			   zone_is_all_unreclaimable(zone),
+			zone->slab_defrag_counter,
+			zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
 		   zone->zone_start_pfn);
 	seq_putc(m, '\n');

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 10/19] buffer heads: Support slab defrag
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (8 preceding siblings ...)
  2008-05-10  2:21 ` [patch 09/19] slub: Trigger defragmentation from memory reclaim Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0024-Buffer-heads-Support-slab-defrag.patch --]
[-- Type: text/plain, Size: 3219 bytes --]

Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/buffer.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/fs/buffer.c	2008-07-31 12:18:59.000000000 -0500
@@ -3316,6 +3316,104 @@
 }
 EXPORT_SYMBOL(bh_submit_read);
 
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int rc;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = 1,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 0
+	};
+
+	if (!mapping->a_ops->writepage)
+		/* No write method for the address space */
+		return;
+
+	if (!clear_page_dirty_for_io(page))
+		/* Someone else already triggered a write */
+		return;
+
+	rc = mapping->a_ops->writepage(page, &wbc);
+	if (rc < 0)
+		/* I/O Error writing */
+		return;
+
+	if (rc == AOP_WRITEPAGE_ACTIVATE)
+		unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+	struct page *page;
+	struct buffer_head *bh;
+	int i, j;
+	int n = 0;
+
+	for (i = 0; i < nr; i++) {
+		bh = v[i];
+		v[i] = NULL;
+
+		page = bh->b_page;
+
+		if (page && PagePrivate(page)) {
+			for (j = 0; j < n; j++)
+				if (page == v[j])
+					continue;
+		}
+
+		if (get_page_unless_zero(page))
+			v[n++] = page;
+	}
+	return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was set up by get_buffer().
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+							void *private)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		page = v[i];
+
+		if (!page || PageWriteback(page))
+			continue;
+
+		if (!TestSetPageLocked(page)) {
+			if (PageDirty(page))
+				trigger_write(page);
+			else {
+				if (PagePrivate(page))
+					try_to_free_buffers(page);
+				unlock_page(page);
+			}
+		}
+		put_page(page);
+	}
+}
+
 static void
 init_buffer_head(void *data)
 {
@@ -3334,6 +3432,7 @@
 				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 				SLAB_MEM_SPREAD),
 				init_buffer_head);
+	kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 11/19] inodes: Support generic defragmentation
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (9 preceding siblings ...)
  2008-05-10  2:21 ` [patch 10/19] buffer heads: Support slab defrag Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 12/19] Filesystem: Ext2 filesystem defrag Christoph Lameter
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0025-inodes-Support-generic-defragmentation.patch --]
[-- Type: text/plain, Size: 5124 bytes --]

This implements the ability to remove inodes in a particular slab
from inode caches. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/inode.c         |  123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    6 ++
 2 files changed, 129 insertions(+)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/inode.c	2008-07-31 12:18:15.000000000 -0500
@@ -1363,6 +1363,128 @@
 __setup("ihash_entries=", set_ihash_entries);
 
 /*
+ * Obtain a refcount on a list of struct inodes pointed to by v. If the
+ * inode is in the process of being freed then zap the v[] entry so that
+ * we skip the freeing attempts later.
+ *
+ * This is a generic function for the ->get slab defrag callback.
+ */
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * fs inode. The offset is the offset of the struct inode in the fs inode.
+ *
+ * The function adds to the pointers in v[] in order to make them point to
+ * struct inode. Then get_inodes() is used to get the refcount.
+ * The converted v[] pointers can then also be passed to the kick() callback
+ * without further processing.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+/*
+ * Generic callback function slab defrag ->kick methods. Takes the
+ * array with inodes where we obtained refcounts using fs_get_inodes()
+ * or get_inodes() and tries to free them.
+ */
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	int active;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				/*
+				 * Should we really be doing this? Or
+				 * limit the writeback here to only a few pages?
+				 *
+				 * Possibly an expensive operation but we
+				 * cannot reclaim the inode if the pages
+				 * are still present.
+				 */
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+
+		if (!inode)
+			/* inode is alrady being freed */
+			continue;
+
+		active = inode->i_sb->s_flags & MS_ACTIVE;
+		iput(inode);
+		if (abort || !active)
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+/*
  * Initialize the waitqueues and inode hash table.
  */
 void __init inode_init_early(void)
@@ -1401,6 +1523,7 @@
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/include/linux/fs.h	2008-07-31 12:18:15.000000000 -0500
@@ -1844,6 +1844,12 @@
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 12/19] Filesystem: Ext2 filesystem defrag
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (10 preceding siblings ...)
  2008-05-10  2:21 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 13/19] Filesystem: Ext3 " Christoph Lameter
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: ext2-defrag --]
[-- Type: text/plain, Size: 1035 bytes --]

Support defragmentation for ext2 filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/ext2/super.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/ext2/super.c	2008-07-31 12:18:15.000000000 -0500
@@ -171,6 +171,12 @@
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
@@ -180,6 +186,9 @@
 					     init_once);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
+
+	kmem_cache_setup_defrag(ext2_inode_cachep,
+			ext2_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 13/19] Filesystem: Ext3 filesystem defrag
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (11 preceding siblings ...)
  2008-05-10  2:21 ` [patch 12/19] Filesystem: Ext2 filesystem defrag Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: ext3-defrag --]
[-- Type: text/plain, Size: 1032 bytes --]

Support defragmentation for ext3 filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/ext3/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/ext3/super.c	2008-07-31 12:18:15.000000000 -0500
@@ -484,6 +484,12 @@
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
@@ -493,6 +499,8 @@
 					     init_once);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext3_inode_cachep,
+			ext3_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 14/19] Filesystem: Ext4 filesystem defrag
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (12 preceding siblings ...)
  2008-05-10  2:21 ` [patch 13/19] Filesystem: Ext3 " Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-08-03  1:54   ` Theodore Tso
  2008-05-10  2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: ext4-defrag --]
[-- Type: text/plain, Size: 1032 bytes --]

Support defragmentation for extX filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/ext4/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/ext4/super.c	2008-07-31 12:18:15.000000000 -0500
@@ -607,6 +607,12 @@
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
@@ -616,6 +622,8 @@
 					     init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext4_inode_cachep,
+			ext4_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 15/19] Filesystem: XFS slab defragmentation
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (13 preceding siblings ...)
  2008-05-10  2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-08-03  1:42   ` Dave Chinner
  2008-05-10  2:21 ` [patch 16/19] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0027-FS-XFS-slab-defragmentation.patch --]
[-- Type: text/plain, Size: 877 bytes --]

Support inode defragmentation for xfs

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/xfs/linux-2.6/xfs_super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c	2008-07-31 12:18:15.000000000 -0500
@@ -861,6 +861,7 @@
 	xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
 	if (!xfs_ioend_zone)
 		goto out_destroy_vnode_zone;
+	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
 
 	xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
 						  xfs_ioend_zone);

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 16/19] Filesystem: /proc filesystem support for slab defrag
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (14 preceding siblings ...)
  2008-05-10  2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 17/19] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Alexey Dobriyan, Christoph Lameter, Christoph Lameter,
	linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm,
	Dave Chinner

[-- Attachment #1: 0028-FS-Proc-filesystem-support-for-slab-defrag.patch --]
[-- Type: text/plain, Size: 1096 bytes --]

Support procfs inode defragmentation

Cc: Alexey Dobriyan <adobriyan@sw.ru>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/proc/inode.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

Index: linux-2.6/fs/proc/inode.c
===================================================================
--- linux-2.6.orig/fs/proc/inode.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/proc/inode.c	2008-07-31 12:18:15.000000000 -0500
@@ -106,6 +106,12 @@
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct proc_inode, vfs_inode));
+};
+
 int __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -113,6 +119,8 @@
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD|SLAB_PANIC),
 					     init_once);
+	kmem_cache_setup_defrag(proc_inode_cachep,
+				proc_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 17/19] Filesystem: Slab defrag: Reiserfs support
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (15 preceding siblings ...)
  2008-05-10  2:21 ` [patch 16/19] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 18/19] dentries: Add constructor Christoph Lameter
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0029-FS-Slab-defrag-Reiserfs-support.patch --]
[-- Type: text/plain, Size: 1073 bytes --]

Slab defragmentation: Support reiserfs inode defragmentation.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/reiserfs/super.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

Index: linux-2.6/fs/reiserfs/super.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/super.c	2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/reiserfs/super.c	2008-07-31 12:18:15.000000000 -0500
@@ -533,6 +533,12 @@
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -543,6 +549,8 @@
 						  init_once);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(reiserfs_inode_cachep,
+			reiserfs_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 18/19] dentries: Add constructor
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (16 preceding siblings ...)
  2008-05-10  2:21 ` [patch 17/19] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-05-10  2:21 ` [patch 19/19] dentries: dentry defragmentation Christoph Lameter
  2008-08-03  1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0031-dentries-Add-constructor.patch --]
[-- Type: text/plain, Size: 2156 bytes --]

In order to support defragmentation on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

So provide a constructor.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/dcache.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2008-07-31 12:18:11.000000000 -0500
+++ linux-2.6/fs/dcache.c	2008-07-31 12:18:27.000000000 -0500
@@ -899,6 +899,16 @@
 	.seeks = DEFAULT_SEEKS,
 };
 
+static void dcache_ctor(void *p)
+{
+	struct dentry *dentry = p;
+
+	spin_lock_init(&dentry->d_lock);
+	dentry->d_inode = NULL;
+	INIT_LIST_HEAD(&dentry->d_lru);
+	INIT_LIST_HEAD(&dentry->d_alias);
+}
+
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -936,8 +946,6 @@
 
 	atomic_set(&dentry->d_count, 1);
 	dentry->d_flags = DCACHE_UNHASHED;
-	spin_lock_init(&dentry->d_lock);
-	dentry->d_inode = NULL;
 	dentry->d_parent = NULL;
 	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
@@ -947,9 +955,7 @@
 	dentry->d_cookie = NULL;
 #endif
 	INIT_HLIST_NODE(&dentry->d_hash);
-	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
-	INIT_LIST_HEAD(&dentry->d_alias);
 
 	if (parent) {
 		dentry->d_parent = dget(parent);
@@ -2174,14 +2180,10 @@
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
+	dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
+		0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		dcache_ctor);
+
 	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 19/19] dentries: dentry defragmentation
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (17 preceding siblings ...)
  2008-05-10  2:21 ` [patch 18/19] dentries: Add constructor Christoph Lameter
@ 2008-05-10  2:21 ` Christoph Lameter
  2008-08-03  1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
  19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10  2:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0032-dentries-dentry-defragmentation.patch --]
[-- Type: text/plain, Size: 4092 bytes --]

The dentry pruning for unused entries works in a straightforward way. It
could be made more aggressive if one would actually move dentries instead
of just reclaiming them.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/dcache.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2008-07-31 12:18:15.000000000 -0500
+++ linux-2.6/fs/dcache.c	2008-07-31 12:18:15.000000000 -0500
@@ -32,6 +32,7 @@
 #include <linux/seqlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 
 
@@ -172,7 +173,10 @@
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
-	/*drops the locks, at that point nobody can reach this dentry */
+	/*
+	 * drops the locks, at that point nobody (aside from defrag)
+	 * can reach this dentry
+	 */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
@@ -2176,6 +2180,100 @@
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		/*
+		 * Three sorts of dentries cannot be reclaimed:
+		 *
+		 * 1. dentries that are in the process of being allocated
+		 *    or being freed. In that case the dentry is neither
+		 *    on the LRU nor hashed.
+		 *
+		 * 2. Fake hashed entries as used for anonymous dentries
+		 *    and pipe I/O. The fake hashed entries have d_flags
+		 *    set to indicate a hashed entry. However, the
+		 *    d_hash field indicates that the entry is not hashed.
+		 *
+		 * 3. dentries that have a backing store that is not
+		 *    writable. This is true for tmpsfs and other in
+		 *    memory filesystems. Removing dentries from them
+		 *    would loose dentries for good.
+		 */
+		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+		   (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
+		   (dentry->d_inode &&
+		   !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+			/* Ignore this dentry */
+			v[i] = NULL;
+		else
+			/* dget_locked will remove the dentry from the LRU */
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+				int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		prune_one_dentry(dentry);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations are complete.
+	 */
+	synchronize_rcu();
+}
+
 static void __init dcache_init(void)
 {
 	int loop;
@@ -2185,6 +2283,7 @@
 		dcache_ctor);
 
 	register_shrinker(&dcache_shrinker);
+	kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 15/19] Filesystem: XFS slab defragmentation
  2008-05-10  2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
@ 2008-08-03  1:42   ` Dave Chinner
  2008-08-04 13:36     ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2008-08-03  1:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, akpm, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm

On Fri, May 09, 2008 at 07:21:16PM -0700, Christoph Lameter wrote:
> Support inode defragmentation for xfs
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  fs/xfs/linux-2.6/xfs_super.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c	2008-07-31 12:18:12.000000000 -0500
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c	2008-07-31 12:18:15.000000000 -0500
> @@ -861,6 +861,7 @@
>  	xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
>  	if (!xfs_ioend_zone)
>  		goto out_destroy_vnode_zone;
> +	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
>  
>  	xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
>  						  xfs_ioend_zone);

I think that hunk is mis-applied. You're configuring the
xfs_vnode_zone defrag after allocating the xfs_ioend_zone. This
should be afew lines higher up, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 14/19] Filesystem: Ext4 filesystem defrag
  2008-05-10  2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
@ 2008-08-03  1:54   ` Theodore Tso
  2008-08-13  7:26     ` Pekka Enberg
  0 siblings, 1 reply; 64+ messages in thread
From: Theodore Tso @ 2008-08-03  1:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, akpm, Christoph Lameter, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner

On Fri, May 09, 2008 at 07:21:15PM -0700, Christoph Lameter wrote:
> Support defragmentation for extX filesystem inodes

You forgot to change "extX" to "ext4".  :-)

> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>

						- Ted

^ permalink raw reply	[flat|nested] 64+ messages in thread

* No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
                   ` (18 preceding siblings ...)
  2008-05-10  2:21 ` [patch 19/19] dentries: dentry defragmentation Christoph Lameter
@ 2008-08-03  1:58 ` Matthew Wilcox
  2008-08-03 21:25   ` Pekka Enberg
  2008-08-04 13:43   ` Christoph Lameter
  19 siblings, 2 replies; 64+ messages in thread
From: Matthew Wilcox @ 2008-08-03  1:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel

On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> - Add a patch that obsoletes SLAB and explains why SLOB does not support
>   defrag (Either of those could be theoretically equipped to support
>   slab defrag in some way but it seems that Andrew/Linus want to reduce
>   the number of slab allocators).

Do we have to once again explain that slab still outperforms slub on at
least one important benchmark?  I hope Nick Piggin finds time to finish
tuning slqb; it already outperforms slub.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-03  1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
@ 2008-08-03 21:25   ` Pekka Enberg
  2008-08-04  2:37     ` Rene Herman
  2008-08-04 13:43   ` Christoph Lameter
  1 sibling, 1 reply; 64+ messages in thread
From: Pekka Enberg @ 2008-08-03 21:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel

Hi Matthew,

Matthew Wilcox wrote:
> Do we have to once again explain that slab still outperforms slub on at
> least one important benchmark?  I hope Nick Piggin finds time to finish
> tuning slqb; it already outperforms slub.

No, you don't have to. I haven't merged that patch nor do I intend to do 
so until the regressions are fixed.

And yes, I'm still waiting to hear from you how we're now doing with 
higher order page allocations...

		Pekka

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-03 21:25   ` Pekka Enberg
@ 2008-08-04  2:37     ` Rene Herman
  2008-08-04 21:22       ` Pekka Enberg
  0 siblings, 1 reply; 64+ messages in thread
From: Rene Herman @ 2008-08-04  2:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Matthew Wilcox, Christoph Lameter, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

On 03-08-08 23:25, Pekka Enberg wrote:

> Matthew Wilcox wrote:

>> Do we have to once again explain that slab still outperforms slub on at
>> least one important benchmark?  I hope Nick Piggin finds time to finish
>> tuning slqb; it already outperforms slub.
> 
> No, you don't have to. I haven't merged that patch nor do I intend to do 
> so until the regressions are fixed.
> 
> And yes, I'm still waiting to hear from you how we're now doing with 
> higher order page allocations...

General interested question -- I recently "accidentally" read some of 
slub and I believe that it doesn't feature the cache colouring support 
that slab did? Is that true, and if so, wasn't it needed/useful?

Rene.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 15/19] Filesystem: XFS slab defragmentation
  2008-08-03  1:42   ` Dave Chinner
@ 2008-08-04 13:36     ` Christoph Lameter
  0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 13:36 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, akpm, Christoph Lameter,
	linux-kernel, linu

Dave Chinner wrote:

> I think that hunk is mis-applied. You're configuring the
> xfs_vnode_zone defrag after allocating the xfs_ioend_zone. This
> should be afew lines higher up, right?

That would be nicer but its not a bug to have the setup where it is right now.

Fix:


Subject: defrag/xfs: Move defrag setup directly after xfs_vnode_zone kmem
cache creation

Move the setup of the defrag directly after the creation of the xfs_vnode_zone

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c	2008-08-04 08:27:09.000000000
-0500
+++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c	2008-08-04 08:27:25.000000000 -0500
@@ -2021,11 +2021,11 @@
 	if (!xfs_vnode_zone)
 		goto out;

+	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
+
 	xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
 	if (!xfs_ioend_zone)
 		goto out_destroy_vnode_zone;
-	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
-
 	xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
 						  xfs_ioend_zone);
 	if (!xfs_ioend_pool)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-03  1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
  2008-08-03 21:25   ` Pekka Enberg
@ 2008-08-04 13:43   ` Christoph Lameter
  2008-08-04 14:48     ` Jamie Lokier
                       ` (2 more replies)
  1 sibling, 3 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 13:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pekka Enberg, akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel

Matthew Wilcox wrote:
> On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
>> - Add a patch that obsoletes SLAB and explains why SLOB does not support
>>   defrag (Either of those could be theoretically equipped to support
>>   slab defrag in some way but it seems that Andrew/Linus want to reduce
>>   the number of slab allocators).
> 
> Do we have to once again explain that slab still outperforms slub on at
> least one important benchmark?  I hope Nick Piggin finds time to finish
> tuning slqb; it already outperforms slub.
> 

Uhh. I forgot to delete that statement. I did not include the patch in the series.

We have a fundamental issue design issue there. Queuing on free can result in
better performance as in SLAB. However, it limits concurrency (per node lock
taking) and causes latency spikes due to queue processing (f.e. one test load
had 118.65 vs. 34 usecs just by switching to SLUB).

Could you address the performance issues in different ways? F.e. try to free
when the object is hot or free from multiple processors? SLAB has to take the
list_lock rather frequently under high concurrent loads (depends on queue
size). That will not occur with SLUB. So you actually can free (and allocate)
concurrently with high performance.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 13:43   ` Christoph Lameter
@ 2008-08-04 14:48     ` Jamie Lokier
  2008-08-04 15:21       ` Jamie Lokier
  2008-08-04 15:11     ` Rik van Riel
  2008-08-04 16:47     ` KOSAKI Motohiro
  2 siblings, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2008-08-04 14:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

Christoph Lameter wrote:
> Matthew Wilcox wrote:
> > On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> >> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> >>   defrag (Either of those could be theoretically equipped to support
> >>   slab defrag in some way but it seems that Andrew/Linus want to reduce
> >>   the number of slab allocators).
> > 
> > Do we have to once again explain that slab still outperforms slub on at
> > least one important benchmark?  I hope Nick Piggin finds time to finish
> > tuning slqb; it already outperforms slub.
> > 
> 
> Uhh. I forgot to delete that statement. I did not include the patch
> in the series.
> 
> We have a fundamental issue design issue there. Queuing on free can result in
> better performance as in SLAB. However, it limits concurrency (per node lock
> taking) and causes latency spikes due to queue processing (f.e. one test load
> had 118.65 vs. 34 usecs just by switching to SLUB).

Vaguely on this topic, has anyone studied the effects of SLAB/SLUB
etc. on MMUless systems?

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 13:43   ` Christoph Lameter
  2008-08-04 14:48     ` Jamie Lokier
@ 2008-08-04 15:11     ` Rik van Riel
  2008-08-04 16:02       ` Christoph Lameter
  2008-08-04 16:47     ` KOSAKI Motohiro
  2 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2008-08-04 15:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi

On Mon, 04 Aug 2008 08:43:21 -0500
Christoph Lameter <cl@linux-foundation.org> wrote:
> Matthew Wilcox wrote:
> > On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> >> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> >>   defrag (Either of those could be theoretically equipped to support
> >>   slab defrag in some way but it seems that Andrew/Linus want to reduce
> >>   the number of slab allocators).
> > 
> > Do we have to once again explain that slab still outperforms slub on at
> > least one important benchmark?  I hope Nick Piggin finds time to finish
> > tuning slqb; it already outperforms slub.
> > 
> 
> Uhh. I forgot to delete that statement. I did not include the patch in the series.
> 
> We have a fundamental issue design issue there. Queuing on free can result in
> better performance as in SLAB. However, it limits concurrency (per node lock
> taking) and causes latency spikes due to queue processing (f.e. one test load
> had 118.65 vs. 34 usecs just by switching to SLUB).
> 
> Could you address the performance issues in different ways? F.e. try to free
> when the object is hot or free from multiple processors? SLAB has to take the
> list_lock rather frequently under high concurrent loads (depends on queue
> size). That will not occur with SLUB. So you actually can free (and allocate)
> concurrently with high performance.

I guess you could bypass the queueing on free for objects that
come from a "local" SLUB page, only queueing objects that go
onto remote pages.

That way workloads that already perform well with SLUB should
keep the current performance, while workloads that currently
perform badly with SLUB should get an improvement.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 14:48     ` Jamie Lokier
@ 2008-08-04 15:21       ` Jamie Lokier
  2008-08-04 16:35         ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2008-08-04 15:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

Jamie Lokier wrote:
> Vaguely on this topic, has anyone studied the effects of SLAB/SLUB
> etc. on MMUless systems?

The reason is that MMU-less systems are extremely sensitive to
fragmentation.  Every program started on those systems must allocate a
large contiguous block for the code and data, and every malloc >1 page
is the same.  If memory is too fragmented, starting new programs fails.

The high-order page-allocator defragmentation lately should help with
that.

The different behaviours of SLAB/SLUB might result in different levels
of fragmentation, so I wonder if anyone has compared them on MMU-less
systems or fragmentation-sensitive workloads on general systems.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 15:11     ` Rik van Riel
@ 2008-08-04 16:02       ` Christoph Lameter
  0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 16:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi

Rik van Riel wrote:

> I guess you could bypass the queueing on free for objects that
> come from a "local" SLUB page, only queueing objects that go
> onto remote pages.

Tried that already. The logic to decide if an object is local is creating
significant overhead. Plus you need queues for the remote nodes. Back to alien
queues?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 15:21       ` Jamie Lokier
@ 2008-08-04 16:35         ` Christoph Lameter
  0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 16:35 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

Jamie Lokier wrote:

> The different behaviours of SLAB/SLUB might result in different levels
> of fragmentation, so I wonder if anyone has compared them on MMU-less
> systems or fragmentation-sensitive workloads on general systems.

Never heard of such a comparison.

MMU less systems typically have a minimal number of processors. For that
configuration the page orders are roughly equivalent to slab. Larger orders
come into play with large amounts of processors.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 13:43   ` Christoph Lameter
  2008-08-04 14:48     ` Jamie Lokier
  2008-08-04 15:11     ` Rik van Riel
@ 2008-08-04 16:47     ` KOSAKI Motohiro
  2008-08-04 17:13       ` Christoph Lameter
  2008-08-04 17:19       ` Christoph Lameter
  2 siblings, 2 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-04 16:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, kosaki.motohiro

Hi

> Could you address the performance issues in different ways? F.e. try to free
> when the object is hot or free from multiple processors? SLAB has to take the
> list_lock rather frequently under high concurrent loads (depends on queue
> size). That will not occur with SLUB. So you actually can free (and allocate)
> concurrently with high performance.

just information. (offtopic?)

When hackbench running, SLUB consume memory very largely than SLAB.
then, SLAB often outperform SLUB in memory stavation state.

I don't know why memory comsumption different.
Anyone know it?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 16:47     ` KOSAKI Motohiro
@ 2008-08-04 17:13       ` Christoph Lameter
  2008-08-04 17:20         ` Pekka Enberg
  2008-08-05 12:06         ` KOSAKI Motohiro
  2008-08-04 17:19       ` Christoph Lameter
  1 sibling, 2 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 17:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, kosaki.motohiro

KOSAKI Motohiro wrote:

> When hackbench running, SLUB consume memory very largely than SLAB.
> then, SLAB often outperform SLUB in memory stavation state.
> 
> I don't know why memory comsumption different.
> Anyone know it?

Can you quantify the difference?

SLAB buffers objects in its queues. SLUB does rely more on the page allocator.
So SLAB may have its own reserves to fall back on.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 16:47     ` KOSAKI Motohiro
  2008-08-04 17:13       ` Christoph Lameter
@ 2008-08-04 17:19       ` Christoph Lameter
  1 sibling, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 17:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, kosaki.motohiro

KOSAKI Motohiro wrote:
>
> When hackbench running, SLUB consume memory very largely than SLAB.
> then, SLAB often outperform SLUB in memory stavation state.

Re memory use: If SLUB finds that there is lock contention on a slab page then
it will allocate a new one and dedicate it to a cpu in order to avoid future
contentions.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 17:13       ` Christoph Lameter
@ 2008-08-04 17:20         ` Pekka Enberg
  2008-08-05 12:06         ` KOSAKI Motohiro
  1 sibling, 0 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-04 17:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Matthew Wilcox, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, kosaki.motohiro

On Mon, Aug 4, 2008 at 8:13 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> KOSAKI Motohiro wrote:
>
>> When hackbench running, SLUB consume memory very largely than SLAB.
>> then, SLAB often outperform SLUB in memory stavation state.
>>
>> I don't know why memory comsumption different.
>> Anyone know it?
>
> Can you quantify the difference?
>
> SLAB buffers objects in its queues. SLUB does rely more on the page allocator.
> So SLAB may have its own reserves to fall back on.

Also, what kind of machine are we talking about here? If there are a
lot of CPUs, SLUB will allocate higher order pages more aggressively
than SLAB by default.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04  2:37     ` Rene Herman
@ 2008-08-04 21:22       ` Pekka Enberg
  2008-08-04 21:41         ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: Pekka Enberg @ 2008-08-04 21:22 UTC (permalink / raw)
  To: Rene Herman
  Cc: Matthew Wilcox, Christoph Lameter, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

Rene Herman wrote:
> On 03-08-08 23:25, Pekka Enberg wrote:
> 
>> Matthew Wilcox wrote:
> 
>>> Do we have to once again explain that slab still outperforms slub on at
>>> least one important benchmark?  I hope Nick Piggin finds time to finish
>>> tuning slqb; it already outperforms slub.
>>
>> No, you don't have to. I haven't merged that patch nor do I intend to 
>> do so until the regressions are fixed.
>>
>> And yes, I'm still waiting to hear from you how we're now doing with 
>> higher order page allocations...
> 
> General interested question -- I recently "accidentally" read some of 
> slub and I believe that it doesn't feature the cache colouring support 
> that slab did? Is that true, and if so, wasn't it needed/useful?

I don't know why Christoph decided not to implement it. Christoph?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 21:22       ` Pekka Enberg
@ 2008-08-04 21:41         ` Christoph Lameter
  2008-08-04 23:09           ` Rene Herman
  0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 21:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rene Herman, Matthew Wilcox, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

Pekka Enberg wrote:
>> General interested question -- I recently "accidentally" read some of
>> slub and I believe that it doesn't feature the cache colouring support
>> that slab did? Is that true, and if so, wasn't it needed/useful?
> 
> I don't know why Christoph decided not to implement it. Christoph?

IMHO cache coloring issues seem to be mostly taken care of by newer more
associative cpu caching designs.

Note that the SLAB design origin is Solaris (See the paper by Jeff Bonwick in
1994 that is quoted in mm/slab.c). Logic for cache coloring is mostly avoided
today due to the complexity it would introduce. See also
http://en.wikipedia.org/wiki/CPU_cache.

What one could add to support cache coloring in SLUB is a prearrangement of
the order of object allocation order by constructing the initial freelist for
a page in a certain way. See mm/slub.c::new_slab()

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 21:41         ` Christoph Lameter
@ 2008-08-04 23:09           ` Rene Herman
  0 siblings, 0 replies; 64+ messages in thread
From: Rene Herman @ 2008-08-04 23:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Matthew Wilcox, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

On 04-08-08 23:41, Christoph Lameter wrote:

>>> General interested question -- I recently "accidentally" read some of
>>> slub and I believe that it doesn't feature the cache colouring support
>>> that slab did? Is that true, and if so, wasn't it needed/useful?
>> I don't know why Christoph decided not to implement it. Christoph?
> 
> IMHO cache coloring issues seem to be mostly taken care of by newer more
> associative cpu caching designs.

I see. Just gathered a bit of data on this (from sandpile.org):

32-byte lines:

P54 : L1 I 8K,            2-Way
          D 8K,            2-Way
       L2 External

P55 : L1 I 16K,           4-Way
          D 16K,           4-Way
       L2 External

P2  : L1 I 16K            4-Way
          D 16K            4-Way
       L2 128K to 2MB      4-Way

P3  : L1 I 16K            4-Way
          D 16K            4-Way
       L2 128K to 2MB      4-Way or
          256K to 2MB      8-Way

64-byte lines:

P4  : L1 I 12K uOP Trace  (8-Way, 6 uOP line)
          D  8K            4-Way or
            16K            8-Way
       L2 128K             2-Way or
          128K, 256K       4-Way or
          512K, 1M, 2M     8-Way
       L3 512K             4-Way or
          1M to 8M         8-Way or
          2M to 16M       16-Way

Core: L1 I 32K            8-Way
          D 32K            8-Way
       L2 512K             2-Way or
          1M               4-Way or
          2M               8-Way or
          3M              12-Way or
          4M              16-Way

K7  : L1 I 64K            2-Way
          D 64K            2-Way
       L2 512, 1M, 2M      2-Way or
          4M, 8M           1-Way or
          64K, 256K, 512K 16-Way

K8  : L1 I 64K            2-Way
          D 64K            2-Way
       L2 128K to 1M      16-Way


The L1 on K7 and K8 especially seems still a bit of worry here.

> Note that the SLAB design origin is Solaris (See the paper by Jeff Bonwick in
> 1994 that is quoted in mm/slab.c). Logic for cache coloring is mostly avoided
> today due to the complexity it would introduce. See also
> http://en.wikipedia.org/wiki/CPU_cache.
> 
> What one could add to support cache coloring in SLUB is a prearrangement of
> the order of object allocation order by constructing the initial freelist for
> a page in a certain way. See mm/slub.c::new_slab()

<remains silent>

To me, colouring always seemed like a fairly promising thing but I won't 
pretend to have any sort of data.

Rene.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-04 17:13       ` Christoph Lameter
  2008-08-04 17:20         ` Pekka Enberg
@ 2008-08-05 12:06         ` KOSAKI Motohiro
  2008-08-05 14:59           ` Christoph Lameter
  1 sibling, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-05 12:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg,
	akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel

> KOSAKI Motohiro wrote:
> 
> > When hackbench running, SLUB consume memory very largely than SLAB.
> > then, SLAB often outperform SLUB in memory stavation state.
> > 
> > I don't know why memory comsumption different.
> > Anyone know it?
> 
> Can you quantify the difference?

machine spec:
CPU: IA64 x 8
MEM: 8G (4G x2node)

test method

1. echo 3 >/proc/sys/vm/drop_caches
2. % ./hackbench 90 process 1000       <- for fill pagetable cache
3. % ./hackbench 90 process 1000


vmstat result

<SLAB (without CONFIG_DEBUG_SLAB)>

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 3223168   6016  38336    0    0     0     0 3181 4314  0 15 85  0  0
2039  2      0 2022144   6016  38336    0    0     0     0 2364 13622  0 49 51  0  0
634  0      0 2629824   6080  38336    0    0     0    64 83582 2538927  5 95  0  0  0
596  0      0 2842624   6080  38336    0    0     0     0 6864 675841  6 94  0  0  0
590  0      0 2993472   6080  38336    0    0     0     0 9514 456085  6 94  0  0  0
503  0      0 3138560   6080  38336    0    0     0     0 8042 276024  4 96  0  0  0

about 3G remain.

<SLUB>
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1066  0      0 323008   3584  18240    0    0     0     0 12037 47353  1 99  0  0  0
1101  0      0 324672   3584  18240    0    0     0     0 6029 25100  1 99  0  0  0
913  0      0 330240   3584  18240    0    0     0     0 9694 54951  2 98  0  0  0

about 300M remain.


So, about 2.5G - 3G difference in 8G mem.





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-05 12:06         ` KOSAKI Motohiro
@ 2008-08-05 14:59           ` Christoph Lameter
  2008-08-06 12:36             ` KOSAKI Motohiro
  2008-08-13 10:46             ` KOSAKI Motohiro
  0 siblings, 2 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-05 14:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:

>> Can you quantify the difference?
> 
> machine spec:
> CPU: IA64 x 8
> MEM: 8G (4G x2node)

16k or 64k page size?

 > procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  2  0      0 3223168   6016  38336    0    0     0     0 3181 4314  0 15 85  0  0
> 2039  2      0 2022144   6016  38336    0    0     0     0 2364 13622  0 49 51  0  0
> 634  0      0 2629824   6080  38336    0    0     0    64 83582 2538927  5 95  0  0  0
> 596  0      0 2842624   6080  38336    0    0     0     0 6864 675841  6 94  0  0  0
> 590  0      0 2993472   6080  38336    0    0     0     0 9514 456085  6 94  0  0  0
> 503  0      0 3138560   6080  38336    0    0     0     0 8042 276024  4 96  0  0  0
> 
> about 3G remain.
> 
> <SLUB>
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 1066  0      0 323008   3584  18240    0    0     0     0 12037 47353  1 99  0  0  0
> 1101  0      0 324672   3584  18240    0    0     0     0 6029 25100  1 99  0  0  0
> 913  0      0 330240   3584  18240    0    0     0     0 9694 54951  2 98  0  0  0
> 
> about 300M remain.
> 
> 
> So, about 2.5G - 3G difference in 8G mem.

Well not sure if that tells us much. Please show us the output of
/proc/meminfo after each run. The slab counters indicate how much memory is
used by the slabs.

It would also be interesting to see the output of the slabinfo command after
the slub run?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-05 14:59           ` Christoph Lameter
@ 2008-08-06 12:36             ` KOSAKI Motohiro
  2008-08-06 14:24               ` Christoph Lameter
  2008-08-13 10:46             ` KOSAKI Motohiro
  1 sibling, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-06 12:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

>>> Can you quantify the difference?
>>
>> machine spec:
>> CPU: IA64 x 8
>> MEM: 8G (4G x2node)
>
> 16k or 64k page size?

64k.


>> So, about 2.5G - 3G difference in 8G mem.
>
> Well not sure if that tells us much. Please show us the output of
> /proc/meminfo after each run. The slab counters indicate how much memory is
> used by the slabs.
>
> It would also be interesting to see the output of the slabinfo command after
> the slub run?

ok.
but i can't do that in this week.
so, I'll do it in next week.

honestly, I don't know how to use slabinfo command :-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-06 12:36             ` KOSAKI Motohiro
@ 2008-08-06 14:24               ` Christoph Lameter
  0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-06 14:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:
>>>> Can you quantify the difference?
>>> machine spec:
>>> CPU: IA64 x 8
>>> MEM: 8G (4G x2node)
>> 16k or 64k page size?
> 
> 64k.
> 
> 
>>> So, about 2.5G - 3G difference in 8G mem.
>> Well not sure if that tells us much. Please show us the output of
>> /proc/meminfo after each run. The slab counters indicate how much memory is
>> used by the slabs.
>>
>> It would also be interesting to see the output of the slabinfo command after
>> the slub run?
> 
> ok.
> but i can't do that in this week.
> so, I'll do it in next week.
> 
> honestly, I don't know how to use slabinfo command :-)

Its in linux/Documentation/vm/slabinfo.c

Do

	gcc -o slabinfo Documentation/vm/slabinfo.c

	./slabinfo

(./slabinfo -h if you are curious and want to use more advanced options)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 11/19] inodes: Support generic defragmentation
  2008-08-11 15:06 [patch 00/19] Slab Fragmentation Reduction V14 Christoph Lameter
@ 2008-08-11 15:06 ` Christoph Lameter
  0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-11 15:06 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm, Dave Chinner

[-- Attachment #1: 0025-inodes-Support-generic-defragmentation.patch --]
[-- Type: text/plain, Size: 5241 bytes --]

This implements the ability to remove inodes in a particular slab
from inode caches. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/inode.c         |  123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    6 ++
 2 files changed, 129 insertions(+)

Index: linux-next/fs/inode.c
===================================================================
--- linux-next.orig/fs/inode.c	2008-08-11 07:42:10.738607937 -0700
+++ linux-next/fs/inode.c	2008-08-11 07:47:04.342348902 -0700
@@ -1363,6 +1363,128 @@ static int __init set_ihash_entries(char
 __setup("ihash_entries=", set_ihash_entries);
 
 /*
+ * Obtain a refcount on a list of struct inodes pointed to by v. If the
+ * inode is in the process of being freed then zap the v[] entry so that
+ * we skip the freeing attempts later.
+ *
+ * This is a generic function for the ->get slab defrag callback.
+ */
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * fs inode. The offset is the offset of the struct inode in the fs inode.
+ *
+ * The function adds to the pointers in v[] in order to make them point to
+ * struct inode. Then get_inodes() is used to get the refcount.
+ * The converted v[] pointers can then also be passed to the kick() callback
+ * without further processing.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+/*
+ * Generic callback function slab defrag ->kick methods. Takes the
+ * array with inodes where we obtained refcounts using fs_get_inodes()
+ * or get_inodes() and tries to free them.
+ */
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	int active;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				/*
+				 * Should we really be doing this? Or
+				 * limit the writeback here to only a few pages?
+				 *
+				 * Possibly an expensive operation but we
+				 * cannot reclaim the inode if the pages
+				 * are still present.
+				 */
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+
+		if (!inode)
+			/* inode is alrady being freed */
+			continue;
+
+		active = inode->i_sb->s_flags & MS_ACTIVE;
+		iput(inode);
+		if (abort || !active)
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+/*
  * Initialize the waitqueues and inode hash table.
  */
 void __init inode_init_early(void)
@@ -1401,6 +1523,7 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
Index: linux-next/include/linux/fs.h
===================================================================
--- linux-next.orig/include/linux/fs.h	2008-08-11 07:42:30.598607988 -0700
+++ linux-next/include/linux/fs.h	2008-08-11 07:47:05.012377598 -0700
@@ -1846,6 +1846,12 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 14/19] Filesystem: Ext4 filesystem defrag
  2008-08-03  1:54   ` Theodore Tso
@ 2008-08-13  7:26     ` Pekka Enberg
  0 siblings, 0 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-13  7:26 UTC (permalink / raw)
  To: Theodore Tso, Christoph Lameter, Pekka Enberg, akpm,
	Christoph Lameter, lin

Theodore Tso wrote:
> On Fri, May 09, 2008 at 07:21:15PM -0700, Christoph Lameter wrote:
>> Support defragmentation for extX filesystem inodes
> 
> You forgot to change "extX" to "ext4".  :-)

Fixed that up now.

>> Reviewed-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Acked-by: "Theodore Ts'o" <tytso@mit.edu>

Thanks!

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-05 14:59           ` Christoph Lameter
  2008-08-06 12:36             ` KOSAKI Motohiro
@ 2008-08-13 10:46             ` KOSAKI Motohiro
  2008-08-13 13:10               ` Christoph Lameter
  1 sibling, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13 10:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg,
	akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel

> Well not sure if that tells us much. Please show us the output of
> /proc/meminfo after each run. The slab counters indicate how much memory is
> used by the slabs.
> 
> It would also be interesting to see the output of the slabinfo command after
> the slub run?

sorry for late responce.

slab use 123M vs slub use 1.5G

Thought?


<slab>

% cat /proc/meminfo
MemTotal:        7701760 kB
MemFree:         5940096 kB
Buffers:            6400 kB
Cached:            27712 kB
SwapCached:        52544 kB
Active:            51520 kB
Inactive:          53248 kB
Active(anon):      26752 kB
Inactive(anon):    41792 kB
Active(file):      24768 kB
Inactive(file):    11456 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2031488 kB
SwapFree:        1958400 kB
Dirty:               192 kB
Writeback:             0 kB
AnonPages:         38400 kB
Mapped:            23232 kB
Slab:             123840 kB
SReclaimable:      30272 kB
SUnreclaim:        93568 kB
PageTables:        10688 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5882368 kB
Committed_AS:     397568 kB
VmallocTotal:   17592177655808 kB
VmallocUsed:       29184 kB
VmallocChunk:   17592177626240 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:    262144 kB

% cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dm_mpath_io            0      0     40 1488    1 : tunables  120   60    8 : slabdata      0      0      0
dm_snap_tracked_chunk      0      0     24 2338    1 : tunables  120   60    8 : slabdata      0      0      0
dm_snap_pending_exception      0      0    112  564    1 : tunables  120   60    8 : slabdata      0      0      0
dm_snap_exception      0      0     32 1818    1 : tunables  120   60    8 : slabdata      0      0      0
kcopyd_job             0      0    408  158    1 : tunables   54   27    8 : slabdata      0      0      0
dm_target_io         515   2338     24 2338    1 : tunables  120   60    8 : slabdata      1      1      0
dm_io                515   1818     32 1818    1 : tunables  120   60    8 : slabdata      1      1      0
scsi_sense_cache      26    496    128  496    1 : tunables  120   60    8 : slabdata      1      1      0
scsi_cmd_cache        26    168    384  168    1 : tunables   54   27    8 : slabdata      1      1      0
uhci_urb_priv          0      0     56 1091    1 : tunables  120   60    8 : slabdata      0      0      0
flow_cache             0      0     96  654    1 : tunables  120   60    8 : slabdata      0      0      0
cfq_io_context        48    760    168  380    1 : tunables  120   60    8 : slabdata      2      2      0
cfq_queue             41    934    136  467    1 : tunables  120   60    8 : slabdata      2      2      0
mqueue_inode_cache      1     56   1152   56    1 : tunables   24   12    8 : slabdata      1      1      0
fat_inode_cache        1     77    840   77    1 : tunables   54   27    8 : slabdata      1      1      0
fat_cache              0      0     32 1818    1 : tunables  120   60    8 : slabdata      0      0      0
hugetlbfs_inode_cache      1     83    776   83    1 : tunables   54   27    8 : slabdata      1      1      0
ext2_inode_cache       0      0   1024   63    1 : tunables   54   27    8 : slabdata      0      0      0
ext2_xattr             0      0     88  711    1 : tunables  120   60    8 : slabdata      0      0      0
jbd2_journal_handle      0      0     24 2338    1 : tunables  120   60    8 : slabdata      0      0      0
jbd2_journal_head      0      0     96  654    1 : tunables  120   60    8 : slabdata      0      0      0
jbd2_revoke_table      0      0     16 3274    1 : tunables  120   60    8 : slabdata      0      0      0
jbd2_revoke_record      0      0     32 1818    1 : tunables  120   60    8 : slabdata      0      0      0
journal_handle        48   4676     24 2338    1 : tunables  120   60    8 : slabdata      2      2      0
journal_head          41   1308     96  654    1 : tunables  120   60    8 : slabdata      2      2      0
revoke_table           4   3274     16 3274    1 : tunables  120   60    8 : slabdata      1      1      0
revoke_record          0      0     32 1818    1 : tunables  120   60    8 : slabdata      0      0      0
ext4_inode_cache       0      0   1192   54    1 : tunables   24   12    8 : slabdata      0      0      0
ext4_xattr             0      0     88  711    1 : tunables  120   60    8 : slabdata      0      0      0
ext4_alloc_context      0      0    168  380    1 : tunables  120   60    8 : slabdata      0      0      0
ext4_prealloc_space      0      0    120  528    1 : tunables  120   60    8 : slabdata      0      0      0
ext3_inode_cache     367   5696   1016   64    1 : tunables   54   27    8 : slabdata     89     89      0
ext3_xattr            99   1422     88  711    1 : tunables  120   60    8 : slabdata      2      2      0
dnotify_cache          1   1488     40 1488    1 : tunables  120   60    8 : slabdata      1      1      0
kioctx                 0      0    384  168    1 : tunables   54   27    8 : slabdata      0      0      0
kiocb                  0      0    256  251    1 : tunables  120   60    8 : slabdata      0      0      0
inotify_event_cache      0      0     40 1488    1 : tunables  120   60    8 : slabdata      0      0      0
inotify_watch_cache      1    861     72  861    1 : tunables  120   60    8 : slabdata      1      1      0
fasync_cache           0      0     24 2338    1 : tunables  120   60    8 : slabdata      0      0      0
shmem_inode_cache    864   1105   1000   65    1 : tunables   54   27    8 : slabdata     17     17      0
pid_namespace          0      0    184  348    1 : tunables  120   60    8 : slabdata      0      0      0
nsproxy                0      0     56 1091    1 : tunables  120   60    8 : slabdata      0      0      0
posix_timers_cache      0      0    184  348    1 : tunables  120   60    8 : slabdata      0      0      0
uid_cache              6    502    256  251    1 : tunables  120   60    8 : slabdata      2      2      0
ia64_partial_page_cache      0      0     48 1259    1 : tunables  120   60    8 : slabdata      0      0      0
UNIX                  32    126   1024   63    1 : tunables   54   27    8 : slabdata      2      2      0
UDP-Lite               0      0   1024   63    1 : tunables   54   27    8 : slabdata      0      0      0
tcp_bind_bucket        4   1924     64  962    1 : tunables  120   60    8 : slabdata      2      2      0
inet_peer_cache        0      0     64  962    1 : tunables  120   60    8 : slabdata      0      0      0
secpath_cache          0      0     64  962    1 : tunables  120   60    8 : slabdata      0      0      0
xfrm_dst_cache         0      0    384  168    1 : tunables   54   27    8 : slabdata      0      0      0
ip_fib_alias           3   1818     32 1818    1 : tunables  120   60    8 : slabdata      1      1      0
ip_fib_hash           15   1722     72  861    1 : tunables  120   60    8 : slabdata      2      2      0
ip_dst_cache          50    336    384  168    1 : tunables   54   27    8 : slabdata      2      2      0
arp_cache              1    251    256  251    1 : tunables  120   60    8 : slabdata      1      1      0
RAW                  129    216    896   72    1 : tunables   54   27    8 : slabdata      3      3      0
UDP                    9    126   1024   63    1 : tunables   54   27    8 : slabdata      2      2      0
tw_sock_TCP            0      0    256  251    1 : tunables  120   60    8 : slabdata      0      0      0
request_sock_TCP       0      0    128  496    1 : tunables  120   60    8 : slabdata      0      0      0
TCP                    5     72   1792   36    1 : tunables   24   12    8 : slabdata      2      2      0
eventpoll_pwq          0      0     72  861    1 : tunables  120   60    8 : slabdata      0      0      0
eventpoll_epi          0      0    128  496    1 : tunables  120   60    8 : slabdata      0      0      0
sgpool-128             2     30   4096   15    1 : tunables   24   12    8 : slabdata      2      2      0
sgpool-64              2     62   2048   31    1 : tunables   24   12    8 : slabdata      2      2      0
sgpool-32              2    126   1024   63    1 : tunables   54   27    8 : slabdata      2      2      0
sgpool-16              2    252    512  126    1 : tunables   54   27    8 : slabdata      2      2      0
sgpool-8              18    502    256  251    1 : tunables  120   60    8 : slabdata      2      2      0
scsi_data_buffer       0      0     24 2338    1 : tunables  120   60    8 : slabdata      0      0      0
scsi_io_context        0      0    112  564    1 : tunables  120   60    8 : slabdata      0      0      0
blkdev_queue          26     70   1864   35    1 : tunables   24   12    8 : slabdata      2      2      0
blkdev_requests       44    212    304  212    1 : tunables   54   27    8 : slabdata      1      1      0
blkdev_ioc            38   1308     96  654    1 : tunables  120   60    8 : slabdata      2      2      0
biovec-256            34     60   4096   15    1 : tunables   24   12    8 : slabdata      4      4      0
biovec-128            34     93   2048   31    1 : tunables   24   12    8 : slabdata      3      3      0
biovec-64             34    126   1024   63    1 : tunables   54   27    8 : slabdata      2      2      0
biovec-16             34    502    256  251    1 : tunables  120   60    8 : slabdata      2      2      0
biovec-4              34   1924     64  962    1 : tunables  120   60    8 : slabdata      2      2      0
biovec-1              37   6548     16 3274    1 : tunables  120   60    8 : slabdata      2      2      0
bio                   37    992    128  496    1 : tunables  120   60    8 : slabdata      2      2      0
sock_inode_cache     188    288    896   72    1 : tunables   54   27    8 : slabdata      4      4      0
skbuff_fclone_cache     16    126    512  126    1 : tunables   54   27    8 : slabdata      1      1      0
skbuff_head_cache   1812  11546    256  251    1 : tunables  120   60    8 : slabdata     46     46      0
file_lock_cache        4    668    192  334    1 : tunables  120   60    8 : slabdata      2      2      0
Acpi-Operand       24947  26691     72  861    1 : tunables  120   60    8 : slabdata     31     31      0
Acpi-ParseExt          0      0     72  861    1 : tunables  120   60    8 : slabdata      0      0      0
Acpi-Parse             0      0     48 1259    1 : tunables  120   60    8 : slabdata      0      0      0
Acpi-State             0      0     80  779    1 : tunables  120   60    8 : slabdata      0      0      0
Acpi-Namespace     18877  21816     32 1818    1 : tunables  120   60    8 : slabdata     12     12      0
page_cgroup         1183 142848     40 1488    1 : tunables  120   60    8 : slabdata     96     96      0
proc_inode_cache     197    902    792   82    1 : tunables   54   27    8 : slabdata     11     11      0
sigqueue               0      0    160  399    1 : tunables  120   60    8 : slabdata      0      0      0
radix_tree_node      719   7254    552  117    1 : tunables   54   27    8 : slabdata     62     62      0
bdev_cache            30    126   1024   63    1 : tunables   54   27    8 : slabdata      2      2      0
sysfs_dir_cache    11089  12464     80  779    1 : tunables  120   60    8 : slabdata     16     16      0
mnt_cache             24    502    256  251    1 : tunables  120   60    8 : slabdata      2      2      0
inode_cache           54    696    744   87    1 : tunables   54   27    8 : slabdata      8      8      0
dentry              1577  17794    224  287    1 : tunables  120   60    8 : slabdata     62     62      0
filp                 706   3765    256  251    1 : tunables  120   60    8 : slabdata     15     15      0
names_cache           46    105   4096   15    1 : tunables   24   12    8 : slabdata      7      7      0
buffer_head         3557 125442    104  606    1 : tunables  120   60    8 : slabdata    207    207      0
mm_struct             76    288    896   72    1 : tunables   54   27    8 : slabdata      4      4      0
vm_area_struct      1340   2178    176  363    1 : tunables  120   60    8 : slabdata      6      6     36
fs_cache              61    992    128  496    1 : tunables  120   60    8 : slabdata      2      2      0
files_cache           62    336    768   84    1 : tunables   54   27    8 : slabdata      4      4      0
signal_cache         161    588    768   84    1 : tunables   54   27    8 : slabdata      7      7      0
sighand_cache        157    390   1664   39    1 : tunables   24   12    8 : slabdata     10     10      0
anon_vma             657   2976     40 1488    1 : tunables  120   60    8 : slabdata      2      2      0
pid                  160    992    128  496    1 : tunables  120   60    8 : slabdata      2      2      0
shared_policy_node      0      0     48 1259    1 : tunables  120   60    8 : slabdata      0      0      0
numa_policy            7    244    264  244    1 : tunables   54   27    8 : slabdata      1      1      0
idr_layer_cache      150    476    544  119    1 : tunables   54   27    8 : slabdata      4      4      0
size-33554432(DMA)      0      0 33554432    1  512 : tunables    1    1    0 : slabdata      0      0      0
size-33554432          0      0 33554432    1  512 : tunables    1    1    0 : slabdata      0      0      0
size-16777216(DMA)      0      0 16777216    1  256 : tunables    1    1    0 : slabdata      0      0      0
size-16777216          0      0 16777216    1  256 : tunables    1    1    0 : slabdata      0      0      0
size-8388608(DMA)      0      0 8388608    1  128 : tunables    1    1    0 : slabdata      0      0      0
size-8388608           0      0 8388608    1  128 : tunables    1    1    0 : slabdata      0      0      0
size-4194304(DMA)      0      0 4194304    1   64 : tunables    1    1    0 : slabdata      0      0      0
size-4194304           0      0 4194304    1   64 : tunables    1    1    0 : slabdata      0      0      0
size-2097152(DMA)      0      0 2097152    1   32 : tunables    1    1    0 : slabdata      0      0      0
size-2097152           0      0 2097152    1   32 : tunables    1    1    0 : slabdata      0      0      0
size-1048576(DMA)      0      0 1048576    1   16 : tunables    1    1    0 : slabdata      0      0      0
size-1048576           0      0 1048576    1   16 : tunables    1    1    0 : slabdata      0      0      0
size-524288(DMA)       0      0 524288    1    8 : tunables    1    1    0 : slabdata      0      0      0
size-524288            0      0 524288    1    8 : tunables    1    1    0 : slabdata      0      0      0
size-262144(DMA)       0      0 262144    1    4 : tunables    1    1    0 : slabdata      0      0      0
size-262144            0      0 262144    1    4 : tunables    1    1    0 : slabdata      0      0      0
size-131072(DMA)       0      0 131072    1    2 : tunables    8    4    0 : slabdata      0      0      0
size-131072            1      1 131072    1    2 : tunables    8    4    0 : slabdata      1      1      0
size-65536(DMA)        0      0  65536    1    1 : tunables   24   12    8 : slabdata      0      0      0
size-65536             4      4  65536    1    1 : tunables   24   12    8 : slabdata      4      4      0
size-32768(DMA)        0      0  32768    2    1 : tunables   24   12    8 : slabdata      0      0      0
size-32768            12     14  32768    2    1 : tunables   24   12    8 : slabdata      7      7      0
size-16384(DMA)        0      0  16384    4    1 : tunables   24   12    8 : slabdata      0      0      0
size-16384            15     28  16384    4    1 : tunables   24   12    8 : slabdata      7      7      0
size-8192(DMA)         0      0   8192    8    1 : tunables   24   12    8 : slabdata      0      0      0
size-8192           2455   2472   8192    8    1 : tunables   24   12    8 : slabdata    309    309      0
size-4096(DMA)         0      0   4096   15    1 : tunables   24   12    8 : slabdata      0      0      0
size-4096           1607   1665   4096   15    1 : tunables   24   12    8 : slabdata    111    111      0
size-2048(DMA)         0      0   2048   31    1 : tunables   24   12    8 : slabdata      0      0      0
size-2048           2706   2914   2048   31    1 : tunables   24   12    8 : slabdata     94     94      0
size-1024(DMA)         0      0   1024   63    1 : tunables   54   27    8 : slabdata      0      0      0
size-1024           2414   2583   1024   63    1 : tunables   54   27    8 : slabdata     41     41      0
size-512(DMA)          0      0    512  126    1 : tunables   54   27    8 : slabdata      0      0      0
size-512            1805   2142    512  126    1 : tunables   54   27    8 : slabdata     17     17      0
size-256(DMA)          0      0    256  251    1 : tunables  120   60    8 : slabdata      0      0      0
size-256           44889  48945    256  251    1 : tunables  120   60    8 : slabdata    195    195      0
size-128(DMA)          0      0    128  496    1 : tunables  120   60    8 : slabdata      0      0      0
size-64(DMA)           0      0     64  962    1 : tunables  120   60    8 : slabdata      0      0      0
size-128           28119  30256    128  496    1 : tunables  120   60    8 : slabdata     61     61      0
size-64            14597  22126     64  962    1 : tunables  120   60    8 : slabdata     23     23      0
kmem_cache           151    155  12416    5    1 : tunables   24   12    8 : slabdata     31     31      0


<SLUB>

% cat /proc/meminfo
MemTotal:        7701376 kB
MemFree:         4740928 kB
Buffers:            4544 kB
Cached:            35584 kB
SwapCached:            0 kB
Active:           119104 kB
Inactive:           9920 kB
Active(anon):      90240 kB
Inactive(anon):        0 kB
Active(file):      28864 kB
Inactive(file):     9920 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2031488 kB
SwapFree:        2031488 kB
Dirty:                64 kB
Writeback:             0 kB
AnonPages:         89152 kB
Mapped:            31232 kB
Slab:            1591680 kB
SReclaimable:      12608 kB
SUnreclaim:      1579072 kB
PageTables:        11904 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5882176 kB
Committed_AS:     446848 kB
VmallocTotal:   17592177655808 kB
VmallocUsed:       29056 kB
VmallocChunk:   17592177626432 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:    262144 kB

% cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kcopyd_job             0      0    408  160    1 : tunables    0    0    0 : slabdata      0      0      0
cfq_io_context      3120   3120    168  390    1 : tunables    0    0    0 : slabdata      8      8      0
cfq_queue           3848   3848    136  481    1 : tunables    0    0    0 : slabdata      8      8      0
mqueue_inode_cache     56     56   1152   56    1 : tunables    0    0    0 : slabdata      1      1      0
fat_inode_cache       77     77    848   77    1 : tunables    0    0    0 : slabdata      1      1      0
fat_cache              0      0     40 1638    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     83     83    784   83    1 : tunables    0    0    0 : slabdata      1      1      0
ext2_inode_cache       0      0   1032   63    1 : tunables    0    0    0 : slabdata      0      0      0
journal_handle     21840  21840     24 2730    1 : tunables    0    0    0 : slabdata      8      8      0
journal_head        4774   4774     96  682    1 : tunables    0    0    0 : slabdata      7      7      0
revoke_table        4096   4096     16 4096    1 : tunables    0    0    0 : slabdata      1      1      0
revoke_record       2048   2048     32 2048    1 : tunables    0    0    0 : slabdata      1      1      0
ext4_inode_cache       0      0   1200   54    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_alloc_context      0      0    168  390    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_prealloc_space      0      0    120  546    1 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache     750   2624   1024   64    1 : tunables    0    0    0 : slabdata     41     41      0
ext3_xattr          4464   4464     88  744    1 : tunables    0    0    0 : slabdata      6      6      0
shmem_inode_cache   1256   1365   1008   65    1 : tunables    0    0    0 : slabdata     21     21      0
nsproxy                0      0     56 1170    1 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache      0      0    184  356    1 : tunables    0    0    0 : slabdata      0      0      0
ip_dst_cache        1360   1360    384  170    1 : tunables    0    0    0 : slabdata      8      8      0
TCP                  180    180   1792   36    1 : tunables    0    0    0 : slabdata      5      5      0
scsi_data_buffer   21840  21840     24 2730    1 : tunables    0    0    0 : slabdata      8      8      0
scsi_io_context        0      0    112  585    1 : tunables    0    0    0 : slabdata      0      0      0
blkdev_queue         140    140   1864   70    2 : tunables    0    0    0 : slabdata      2      2      0
blkdev_requests     1720   1720    304  215    1 : tunables    0    0    0 : slabdata      8      8      0
sock_inode_cache     758    949    896   73    1 : tunables    0    0    0 : slabdata     13     13      0
file_lock_cache     2289   2289    200  327    1 : tunables    0    0    0 : slabdata      7      7      0
Acpi-ParseExt      29117  29120     72  910    1 : tunables    0    0    0 : slabdata     32     32      0
page_cgroup        14660  24570     40 1638    1 : tunables    0    0    0 : slabdata     15     15      0
proc_inode_cache     732    810    800   81    1 : tunables    0    0    0 : slabdata     10     10      0
sigqueue            3272   3272    160  409    1 : tunables    0    0    0 : slabdata      8      8      0
radix_tree_node     1200   1755    560  117    1 : tunables    0    0    0 : slabdata     15     15      0
bdev_cache           256    256   1024   64    1 : tunables    0    0    0 : slabdata      4      4      0
sysfs_dir_cache    16376  16380     80  819    1 : tunables    0    0    0 : slabdata     20     20      0
inode_cache          707    957    752   87    1 : tunables    0    0    0 : slabdata     11     11      0
dentry              3503  11096    224  292    1 : tunables    0    0    0 : slabdata     38     38      0
buffer_head         6920  23985    112  585    1 : tunables    0    0    0 : slabdata     41     41      0
mm_struct            741   1022    896   73    1 : tunables    0    0    0 : slabdata     14     14      0
vm_area_struct      4015   5208    176  372    1 : tunables    0    0    0 : slabdata     14     14      0
signal_cache         801   1020    768   85    1 : tunables    0    0    0 : slabdata     12     12      0
sighand_cache        433    546   1664   39    1 : tunables    0    0    0 : slabdata     14     14      0
anon_vma           10920  10920     48 1365    1 : tunables    0    0    0 : slabdata      8      8      0
shared_policy_node   5460   5460     48 1365    1 : tunables    0    0    0 : slabdata      4      4      0
numa_policy          248    248    264  248    1 : tunables    0    0    0 : slabdata      1      1      0
idr_layer_cache      944    944    552  118    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-65536         32     32  65536    4    4 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-32768        128    128  32768   16    8 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-16384        160    160  16384   32    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-8192         448    448   8192   64    8 : tunables    0    0    0 : slabdata      7      7      0
kmalloc-4096         819  14336   4096   64    4 : tunables    0    0    0 : slabdata    224    224      0
kmalloc-2048        2409   8384   2048   64    2 : tunables    0    0    0 : slabdata    131    131      0
kmalloc-1024        1848  14912   1024   64    1 : tunables    0    0    0 : slabdata    233    233      0
kmalloc-512         2306   2432    512  128    1 : tunables    0    0    0 : slabdata     19     19      0
kmalloc-256        13919 123904    256  256    1 : tunables    0    0    0 : slabdata    484    484      0
kmalloc-128        28739 10747904    128  512    1 : tunables    0    0    0 : slabdata  20992  20992      0
kmalloc-64         10224  10240     64 1024    1 : tunables    0    0    0 : slabdata     10     10      0
kmalloc-32         34806  34816     32 2048    1 : tunables    0    0    0 : slabdata     17     17      0
kmalloc-16         32768  32768     16 4096    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-8          65536  65536      8 8192    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-192         4609 447051    192  341    1 : tunables    0    0    0 : slabdata   1311   1311      0
kmalloc-96          5456   5456     96  682    1 : tunables    0    0    0 : slabdata      8      8      0
kmem_cache_node     3276   3276     80  819    1 : tunables    0    0    0 : slabdata      4      4      0

% slabinfo
Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
:at-0000016               4096      16    65.5K          0/0/1 4096 0   0 100 *a
:at-0000024              21840      24   524.2K          0/0/8 2730 0   0  99 *a
:at-0000032               2048      32    65.5K          0/0/1 2048 0   0 100 *Aa
:at-0000088               4464      88   393.2K          0/0/6  744 0   0  99 *a
:at-0000096               4774      96   458.7K          0/0/7  682 0   0  99 *a
:t-0000016               32768      16   524.2K          0/0/8 4096 0   0 100 *
:t-0000024               21840      24   524.2K          0/0/8 2730 0   0  99 *
:t-0000032               34806      32     1.1M          9/1/8 2048 0   5  99 *
:t-0000040               14660      40   983.0K          7/7/8 1638 0  46  59 *
:t-0000048                5460      48   262.1K          0/0/4 1365 0   0  99 *
:t-0000064               10224      64   655.3K          2/1/8 1024 0  10  99 *
:t-0000072               29117      72     2.0M         26/2/6  910 0   6  99 *
:t-0000080               16376      80     1.3M         12/1/8  819 0   5  99 *
:t-0000096                5456      96   524.2K          0/0/8  682 0   0  99 *
:t-0000128               28739     128     1.3G  20984/20984/8  512 0  99   0 *
:t-0000256               15285     256    31.7M      476/438/8  256 0  90  12 *
:t-0000384                1360     352   524.2K          0/0/8  170 0   0  91 *A
:t-0000512                2306     512     1.2M         11/3/8  128 0  15  94 *
:t-0000768                 801     768   786.4K          4/4/8   85 0  33  78 *A
:t-0000896                 741     880   917.5K          6/5/8   73 0  35  71 *A
:t-0001024                1848    1024    15.2M      225/214/8   64 0  91  12 *
:t-0002048                2406    2048    17.1M      123/115/8   64 1  87  28 *
:t-0004096                 819    4096    58.7M      216/216/8   64 2  96   5 *
anon_vma                 10920      40   524.2K          0/0/8 1365 0   0  83
bdev_cache                 256    1008   262.1K          0/0/4   64 0   0  98 Aa
blkdev_queue               140    1864   262.1K          0/0/2   70 1   0  99
blkdev_requests           1720     304   524.2K          0/0/8  215 0   0  99
buffer_head               7493     104     2.6M        33/32/8  585 0  78  29 a
cfq_io_context            3120     168   524.2K          0/0/8  390 0   0  99
cfq_queue                 3848     136   524.2K          0/0/8  481 0   0  99
dentry                    3793     224     2.4M        30/29/8  292 0  76  34 a
ext3_inode_cache           750    1016     2.6M        33/33/8   64 0  80  28 a
fat_inode_cache             77     840    65.5K          0/0/1   77 0   0  98 a
file_lock_cache           2289     192   458.7K          0/0/7  327 0   0  95
hugetlbfs_inode_cache       83     776    65.5K          0/0/1   83 0   0  98
idr_layer_cache            944     544   524.2K          0/0/8  118 0   0  97
inode_cache               1044     744   786.4K          4/0/8   87 0   0  98 a
kmalloc-16384              160   16384     2.6M          0/0/5   32 3   0 100
kmalloc-192               4609     192    85.9M    1303/1303/8  341 0  99   1
kmalloc-32768              128   32768     4.1M          0/0/8   16 3   0 100
kmalloc-65536               32   65536     2.0M          0/0/8    4 2   0 100
kmalloc-8                65536       8   524.2K          0/0/8 8192 0   0 100
kmalloc-8192               448    8192     3.6M          0/0/7   64 3   0 100
kmem_cache_node           3276      80   262.1K          0/0/4  819 0   0  99 *
mqueue_inode_cache          56    1064    65.5K          0/0/1   56 0   0  90 A
numa_policy                248     264    65.5K          0/0/1  248 0   0  99
proc_inode_cache           732     792   655.3K          2/1/8   81 0  10  88 a
radix_tree_node           1200     552   983.0K          7/7/8  117 0  46  67 a
shmem_inode_cache         1256    1000     1.3M         13/4/8   65 0  19  91
sighand_cache              433    1608   917.5K          6/4/8   39 0  28  75 A
sigqueue                  3272     160   524.2K          0/0/8  409 0   0  99
sock_inode_cache           758     832   851.9K          5/4/8   73 0  30  74 Aa
TCP                        180    1712   327.6K          0/0/5   36 0   0  94 A
vm_area_struct            4015     176   917.5K          6/6/8  372 0  42  77








^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 10:46             ` KOSAKI Motohiro
@ 2008-08-13 13:10               ` Christoph Lameter
  2008-08-13 14:14                 ` KOSAKI Motohiro
  2008-08-14  7:15                 ` Pekka Enberg
  0 siblings, 2 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-13 13:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:


> <SLUB>
> 
> % cat /proc/meminfo
>
> Slab:            1591680 kB
> SReclaimable:      12608 kB
> SUnreclaim:      1579072 kB

Unreclaimable grew very big.


> :t-0000128               28739     128     1.3G  20984/20984/8  512 0  99   0 *

Argh. Most slabs contain a single object. Probably due to the conflict resolution.


> kmalloc-192               4609     192    85.9M    1303/1303/8  341 0  99   1

And a similar but not so severe issue here.

The obvious fix is to avoid allocating another slab on conflict but how will
this impact performance?


Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-08-13 08:06:00.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-08-13 08:07:59.000000000 -0500
@@ -1253,13 +1253,11 @@
 static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
 							struct page *page)
 {
-	if (slab_trylock(page)) {
-		list_del(&page->lru);
-		n->nr_partial--;
-		__SetPageSlubFrozen(page);
-		return 1;
-	}
-	return 0;
+	slab_lock(page);
+	list_del(&page->lru);
+	n->nr_partial--;
+	__SetPageSlubFrozen(page);
+	return 1;
 }


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 13:10               ` Christoph Lameter
@ 2008-08-13 14:14                 ` KOSAKI Motohiro
  2008-08-13 14:16                   ` Pekka Enberg
  2008-08-13 14:31                   ` Christoph Lameter
  2008-08-14  7:15                 ` Pekka Enberg
  1 sibling, 2 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13 14:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

>> :t-0000128               28739     128     1.3G  20984/20984/8  512 0  99   0 *
>
> Argh. Most slabs contain a single object. Probably due to the conflict resolution.

agreed with the issue exist in lock contention code.


> The obvious fix is to avoid allocating another slab on conflict but how will
> this impact performance?
>
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c    2008-08-13 08:06:00.000000000 -0500
> +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> @@ -1253,13 +1253,11 @@
>  static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
>                                                        struct page *page)
>  {
> -       if (slab_trylock(page)) {
> -               list_del(&page->lru);
> -               n->nr_partial--;
> -               __SetPageSlubFrozen(page);
> -               return 1;
> -       }
> -       return 0;
> +       slab_lock(page);
> +       list_del(&page->lru);
> +       n->nr_partial--;
> +       __SetPageSlubFrozen(page);
> +       return 1;
>  }

I don't mesure it yet. I don't like this patch.
maybe, it decrease other typical benchmark.

So, I think better way is

1. slab_trylock(), if success goto 10.
2. check fragmentation ratio, if low goto 10
3. slab_lock()
10. return func

I think this way doesn't cause performance regression.
because high fragmentation cause defrag and compaction lately.
So, prevent fragmentation often increase performance.

Thought?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 14:14                 ` KOSAKI Motohiro
@ 2008-08-13 14:16                   ` Pekka Enberg
  2008-08-13 14:31                   ` Christoph Lameter
  1 sibling, 0 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-13 14:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Matthew Wilcox, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

On Wed, 2008-08-13 at 23:14 +0900, KOSAKI Motohiro wrote:
> >> :t-0000128               28739     128     1.3G  20984/20984/8  512 0  99   0 *
> >
> > Argh. Most slabs contain a single object. Probably due to the conflict resolution.
> 
> agreed with the issue exist in lock contention code.
> 
> 
> > The obvious fix is to avoid allocating another slab on conflict but how will
> > this impact performance?
> >
> >
> > Index: linux-2.6/mm/slub.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slub.c    2008-08-13 08:06:00.000000000 -0500
> > +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> > @@ -1253,13 +1253,11 @@
> >  static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> >                                                        struct page *page)
> >  {
> > -       if (slab_trylock(page)) {
> > -               list_del(&page->lru);
> > -               n->nr_partial--;
> > -               __SetPageSlubFrozen(page);
> > -               return 1;
> > -       }
> > -       return 0;
> > +       slab_lock(page);
> > +       list_del(&page->lru);
> > +       n->nr_partial--;
> > +       __SetPageSlubFrozen(page);
> > +       return 1;
> >  }
> 
> I don't mesure it yet. I don't like this patch.
> maybe, it decrease other typical benchmark.
> 
> So, I think better way is
> 
> 1. slab_trylock(), if success goto 10.
> 2. check fragmentation ratio, if low goto 10
> 3. slab_lock()
> 10. return func
> 
> I think this way doesn't cause performance regression.
> because high fragmentation cause defrag and compaction lately.
> So, prevent fragmentation often increase performance.
> 
> Thought?

I guess that would work. But how exactly would you quantify
"fragmentation ratio?"


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 14:14                 ` KOSAKI Motohiro
  2008-08-13 14:16                   ` Pekka Enberg
@ 2008-08-13 14:31                   ` Christoph Lameter
  2008-08-13 15:05                     ` KOSAKI Motohiro
  1 sibling, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-13 14:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:
> 
> I don't mesure it yet. I don't like this patch.
> maybe, it decrease other typical benchmark.

Yes but running with this patch would allow us to verify that we understand
what is causing the problem. There are other solutions like skipping to the
next partial slab on the list that could fix performance issues that the patch
may cause. A test will give us:

1. Confirmation that the memory use is caused by the trylock.

2. Some performance numbers. If these show a regression then we have some
markers that we can measure other solutions against.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 14:31                   ` Christoph Lameter
@ 2008-08-13 15:05                     ` KOSAKI Motohiro
  2008-08-14 19:44                       ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13 15:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

> Yes but running with this patch would allow us to verify that we understand
> what is causing the problem. There are other solutions like skipping to the
> next partial slab on the list that could fix performance issues that the patch
> may cause. A test will give us:
>
> 1. Confirmation that the memory use is caused by the trylock.
>
> 2. Some performance numbers. If these show a regression then we have some
> markers that we can measure other solutions against.

okey.
I will confirm its patch at next week.

(unfortunately, my company don't business in rest this week)

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 13:10               ` Christoph Lameter
  2008-08-13 14:14                 ` KOSAKI Motohiro
@ 2008-08-14  7:15                 ` Pekka Enberg
  2008-08-14 14:45                   ` Christoph Lameter
  2008-08-14 15:06                   ` Christoph Lameter
  1 sibling, 2 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-14  7:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, KOSAKI Motohiro, Matthew Wilcox, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel

Hi Christoph,

Christoph Lameter wrote:
> The obvious fix is to avoid allocating another slab on conflict but how will
> this impact performance?
> 
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2008-08-13 08:06:00.000000000 -0500
> +++ linux-2.6/mm/slub.c	2008-08-13 08:07:59.000000000 -0500
> @@ -1253,13 +1253,11 @@
>  static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
>  							struct page *page)
>  {
> -	if (slab_trylock(page)) {
> -		list_del(&page->lru);
> -		n->nr_partial--;
> -		__SetPageSlubFrozen(page);
> -		return 1;
> -	}
> -	return 0;
> +	slab_lock(page);
> +	list_del(&page->lru);
> +	n->nr_partial--;
> +	__SetPageSlubFrozen(page);
> +	return 1;
>  }

This patch hard locks  on my 2-way 64-bit x86 machine (sysrq doesn't 
respond) when I run hackbench.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-14  7:15                 ` Pekka Enberg
@ 2008-08-14 14:45                   ` Christoph Lameter
  2008-08-14 15:06                   ` Christoph Lameter
  1 sibling, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-14 14:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, KOSAKI Motohiro, Matthew Wilcox, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel

Pekka Enberg wrote:
> This patch hard locks  on my 2-way 64-bit x86 machine (sysrq doesn't 
> respond) when I run hackbench.
Hmmm.. Then the issue may be different than we thought. Lock may be 
taken recursively in some situations.
Can you enable lockdep?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-14  7:15                 ` Pekka Enberg
  2008-08-14 14:45                   ` Christoph Lameter
@ 2008-08-14 15:06                   ` Christoph Lameter
  1 sibling, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-14 15:06 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, KOSAKI Motohiro, Matthew Wilcox, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel

Pekka Enberg wrote:
>
> This patch hard locks  on my 2-way 64-bit x86 machine (sysrq doesn't 
> respond) when I run hackbench.
At that point we take the listlock and then the slab lock which is a 
lock inversion if we do not use a trylock here. Crap.

Hmmm.. The code already goes to the next slab if an earlier one is 
already locked. So I do not see how the large partial lists could be 
generated.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-13 15:05                     ` KOSAKI Motohiro
@ 2008-08-14 19:44                       ` Christoph Lameter
  2008-08-15 16:44                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-14 19:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

This is a NUMA system right? Then we have another mechanism that will avoid
off node memory references by allocating new slabs. Can you set the
node_defrag parameter to 0? (Noted by Adrian).



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-14 19:44                       ` Christoph Lameter
@ 2008-08-15 16:44                         ` KOSAKI Motohiro
  2008-08-15 18:24                           ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-15 16:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

> This is a NUMA system right?

True.
My system is

CPU: ia64 x8
MEM: 8G (4G x 2node)

> Then we have another mechanism that will avoid
> off node memory references by allocating new slabs. Can you set the
> node_defrag parameter to 0? (Noted by Adrian).

Please let me know that operations ?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-15 16:44                         ` KOSAKI Motohiro
@ 2008-08-15 18:24                           ` Christoph Lameter
  2008-08-15 19:42                             ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-15 18:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:

>> Then we have another mechanism that will avoid
>> off node memory references by allocating new slabs. Can you set the
>> node_defrag parameter to 0? (Noted by Adrian).
> 
> Please let me know that operations ?

The control over the preferences of node local vs. remote defrag is occurring
via /sys/kernel/slab/<slabcache>/remote_node_defrag ratio. Default is 10%.
Comments in get_any_partial explain the operations.

The default setting means that in 9 out of 10 cases slub will prefer creating
a new slab over taking one from the remote node (meaning the memory is node
local, probably not important in your 2 node case). It will therefore waste
memory because local memory may be more efficient to use.

Setting remote_node_defrag_ratio to 100 will make slub always take the remote
slab instead of allocating a new one.


    /*
         * The defrag ratio allows a configuration of the tradeoffs between
         * inter node defragmentation and node local allocations. A lower
         * defrag_ratio increases the tendency to do local allocations
         * instead of attempting to obtain partial slabs from other nodes.
         *
         * If the defrag_ratio is set to 0 then kmalloc() always
         * returns node local objects. If the ratio is higher then kmalloc()
         * may return off node objects because partial slabs are obtained
         * from other nodes and filled up.
         *
         * If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
         * defrag_ratio = 1000) then every (well almost) allocation will
         * first attempt to defrag slab caches on other nodes. This means
         * scanning over all nodes to look for partial slabs which may be
         * expensive if we do it every time we are trying to find a slab
         * with available objects.
         */
        if (!s->remote_node_defrag_ratio ||
                        get_cycles() % 1024 > s->remote_node_defrag_ratio)
                return NULL;





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-15 18:24                           ` Christoph Lameter
@ 2008-08-15 19:42                             ` Christoph Lameter
  2008-08-18 10:08                               ` KOSAKI Motohiro
  0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-15 19:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

Christoph Lameter wrote:

> Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> slab instead of allocating a new one.

As pointed out by Adrian D. off list:

The max remote_node_defrag_ratio is 99.

Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
allow 100 to switch off any node local allocs?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-15 19:42                             ` Christoph Lameter
@ 2008-08-18 10:08                               ` KOSAKI Motohiro
  2008-08-18 10:34                                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-18 10:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

> Christoph Lameter wrote:
> 
> > Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> > slab instead of allocating a new one.
> 
> As pointed out by Adrian D. off list:
> 
> The max remote_node_defrag_ratio is 99.
> 
> Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
> allow 100 to switch off any node local allocs?

Hmmm, 
it doesn't change any behavior.

I did ..

1. slub code change (see below)


Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4056,7 +4056,7 @@ static ssize_t remote_node_defrag_ratio_
        if (err)
                return err;

-       if (ratio < 100)
+       if (ratio <= 100)
                s->remote_node_defrag_ratio = ratio * 10;

        return length;


2. change remote defrag ratio
	# echo 100 > /sys/kernel/slab/:t-0000128/remote_node_defrag_ratio
	# cat /sys/kernel/slab/:t-0000128/remote_node_defrag_ratio
	100

3. ran hackbench
4. ./slabinfo

Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
:at-0000016               4096      16    65.5K          0/0/1 4096 0   0 100 *a
:at-0000024              21840      24   524.2K          0/0/8 2730 0   0  99 *a
:at-0000032               2048      32    65.5K          0/0/1 2048 0   0 100 *Aa
:at-0000088               4464      88   393.2K          0/0/6  744 0   0  99 *a
:at-0000096               5456      96   524.2K          0/0/8  682 0   0  99 *a
:t-0000016               32768      16   524.2K          0/0/8 4096 0   0 100 *
:t-0000024               21840      24   524.2K          0/0/8 2730 0   0  99 *
:t-0000032               34806      32     1.1M          9/1/8 2048 0   5  99 *
:t-0000040               14417      40   917.5K          6/6/8 1638 0  42  62 *
:t-0000048                5460      48   262.1K          0/0/4 1365 0   0  99 *
:t-0000064               10224      64   655.3K          2/1/8 1024 0  10  99 *
:t-0000072               29120      72     2.0M         26/0/6  910 0   0  99 *
:t-0000080               16376      80     1.3M         12/1/8  819 0   5  99 *
:t-0000096                5456      96   524.2K          0/0/8  682 0   0  99 *
:t-0000128               28917     128     1.3G  21041/21041/8  512 0  99   0 *
:t-0000256               15280     256    31.4M      472/436/8  256 0  90  12 *
:t-0000384                1360     352   524.2K          0/0/8  170 0   0  91 *A
:t-0000512                2388     512     1.3M         12/4/8  128 0  20  93 *
:t-0000768                 851     768   851.9K          5/5/8   85 0  38  76 *A
:t-0000896                 742     880   851.9K          5/4/8   73 0  30  76 *A
:t-0001024                1819    1024    15.1M      223/211/8   64 0  91  12 *
:t-0002048                2641    2048    17.9M      129/116/8   64 1  84  30 *
:t-0004096                 817    4096    57.1M      210/210/8   64 2  96   5 *
anon_vma                 10920      40   524.2K          0/0/8 1365 0   0  83
bdev_cache                 256    1008   262.1K          0/0/4   64 0   0  98 Aa
blkdev_queue               140    1864   262.1K          0/0/2   70 1   0  99
blkdev_requests           1720     304   524.2K          0/0/8  215 0   0  99
buffer_head               7284     104     2.5M        31/30/8  585 0  76  29 a
cfq_io_context            3120     168   524.2K          0/0/8  390 0   0  99
cfq_queue                 3848     136   524.2K          0/0/8  481 0   0  99
dentry                    3775     224     2.5M        31/29/8  292 0  74  33 a
ext3_inode_cache           740    1016     2.4M        30/30/8   64 0  78  30 a
fat_inode_cache             77     840    65.5K          0/0/1   77 0   0  98 a
file_lock_cache           2616     192   524.2K          0/0/8  327 0   0  95
hugetlbfs_inode_cache       83     776    65.5K          0/0/1   83 0   0  98
idr_layer_cache            944     544   524.2K          0/0/8  118 0   0  97
inode_cache               1050     744   851.9K          5/1/8   87 0   7  91 a
kmalloc-16384              160   16384     2.6M          0/0/5   32 3   0 100
kmalloc-192               4578     192    87.5M    1328/1328/8  341 0  99   1
kmalloc-32768              128   32768     4.1M          0/0/8   16 3   0 100
kmalloc-65536               32   65536     2.0M          0/0/8    4 2   0 100
kmalloc-8                65536       8   524.2K          0/0/8 8192 0   0 100
kmalloc-8192               512    8192     4.1M          0/0/8   64 3   0 100
kmem_cache_node           3276      80   262.1K          0/0/4  819 0   0  99 *
mqueue_inode_cache          56    1064    65.5K          0/0/1   56 0   0  90 A
numa_policy                248     264    65.5K          0/0/1  248 0   0  99
proc_inode_cache           655     792   720.8K          3/3/8   81 0  27  71 a
radix_tree_node           1142     552   917.5K          6/6/8  117 0  42  68 a
shmem_inode_cache         1230    1000     1.3M         12/3/8   65 0  15  93
sighand_cache              434    1608   917.5K          6/4/8   39 0  28  76 A
sigqueue                  3272     160   524.2K          0/0/8  409 0   0  99
sock_inode_cache           774     832   851.9K          5/3/8   73 0  23  75 Aa
TCP                        144    1712   262.1K          0/0/4   36 0   0  94 A
vm_area_struct            4034     176   851.9K          5/5/8  372 0  38  83






^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-18 10:08                               ` KOSAKI Motohiro
@ 2008-08-18 10:34                                 ` KOSAKI Motohiro
  2008-08-18 14:08                                   ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-18 10:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

> > Christoph Lameter wrote:
> > 
> > > Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> > > slab instead of allocating a new one.
> > 
> > As pointed out by Adrian D. off list:
> > 
> > The max remote_node_defrag_ratio is 99.
> > 
> > Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
> > allow 100 to switch off any node local allocs?
> 
> Hmmm, 
> it doesn't change any behavior.

Ah, ok.
I did mistakes.

new patch is here.

Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1326,9 +1326,11 @@ static struct page *get_any_partial(stru
         * expensive if we do it every time we are trying to find a slab
         * with available objects.
         */
+#if 0
        if (!s->remote_node_defrag_ratio ||
                        get_cycles() % 1024 > s->remote_node_defrag_ratio)
                return NULL;
+#endif

        zonelist = node_zonelist(slab_node(current->mempolicy), flags);
        for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {


new result is here.

% cat /proc/meminfo
MemTotal:        7701504 kB
MemFree:         5986432 kB
Buffers:            7872 kB
Cached:            38208 kB
SwapCached:            0 kB
Active:           120256 kB
Inactive:          14656 kB
Active(anon):      90304 kB
Inactive(anon):        0 kB
Active(file):      29952 kB
Inactive(file):    14656 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2031488 kB
SwapFree:        2031488 kB
Dirty:               448 kB
Writeback:             0 kB
AnonPages:         89088 kB
Mapped:            31360 kB
Slab:              69952 kB
SReclaimable:      13376 kB
SUnreclaim:        56576 kB
PageTables:        11648 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5882240 kB
Committed_AS:     453440 kB
VmallocTotal:   17592177655808 kB
VmallocUsed:       29312 kB
VmallocChunk:   17592177626112 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:    262144 kB


% slabinfo
Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
:at-0000016               4096      16    65.5K          0/0/1 4096 0   0 100 *a
:at-0000024              21840      24   524.2K          0/0/8 2730 0   0  99 *a
:at-0000032               2048      32    65.5K          0/0/1 2048 0   0 100 *Aa
:at-0000088               2976      88   262.1K          0/0/4  744 0   0  99 *a
:at-0000096               4774      96   458.7K          0/0/7  682 0   0  99 *a
:t-0000016               32768      16   524.2K          0/0/8 4096 0   0 100 *
:t-0000024               21840      24   524.2K          0/0/8 2730 0   0  99 *
:t-0000032               34806      32     1.1M          9/1/8 2048 0   5  99 *
:t-0000040               14279      40   851.9K          5/5/8 1638 0  38  67 *
:t-0000048                5460      48   262.1K          0/0/4 1365 0   0  99 *
:t-0000064               10224      64   655.3K          2/1/8 1024 0  10  99 *
:t-0000072               29109      72     2.0M         26/4/6  910 0  12  99 *
:t-0000080               16379      80     1.3M         12/1/8  819 0   5  99 *
:t-0000096                5456      96   524.2K          0/0/8  682 0   0  99 *
:t-0000128               27831     128     3.6M         48/8/8  512 0  14  97 *
:t-0000256               15401     256     9.8M       143/96/8  256 0  63  39 *
:t-0000384                1360     352   524.2K          0/0/8  170 0   0  91 *A
:t-0000512                2307     512     1.2M         11/3/8  128 0  15  94 *
:t-0000768                 755     768   720.8K          3/3/8   85 0  27  80 *A
:t-0000896                 728     880   851.9K          5/4/8   73 0  30  75 *A
:t-0001024                1810    1024     1.9M         21/4/8   64 0  13  97 *
:t-0002048                2621    2048     5.5M        34/15/8   64 1  35  97 *
:t-0004096                 775    4096     3.4M          5/2/8   64 2  15  93 *
anon_vma                 10920      40   524.2K          0/0/8 1365 0   0  83
bdev_cache                 192    1008   196.6K          0/0/3   64 0   0  98 Aa
blkdev_queue               140    1864   262.1K          0/0/2   70 1   0  99
blkdev_requests           1720     304   524.2K          0/0/8  215 0   0  99
buffer_head               8020     104     2.7M        34/32/8  585 0  76  30 a
cfq_io_context            3120     168   524.2K          0/0/8  390 0   0  99
cfq_queue                 3848     136   524.2K          0/0/8  481 0   0  99
dentry                    3798     224     2.5M        31/30/8  292 0  76  33 a
ext3_inode_cache          1127    1016     2.7M        34/34/8   64 0  80  41 a
fat_inode_cache             77     840    65.5K          0/0/1   77 0   0  98 a
file_lock_cache           2289     192   458.7K          0/0/7  327 0   0  95
hugetlbfs_inode_cache       83     776    65.5K          0/0/1   83 0   0  98
idr_layer_cache            944     544   524.2K          0/0/8  118 0   0  97
inode_cache               1044     744   786.4K          4/0/8   87 0   0  98 a
kmalloc-16384              160   16384     2.6M          0/0/5   32 3   0 100
kmalloc-192               3883     192     1.0M          8/8/8  341 0  50  71
kmalloc-32768              128   32768     4.1M          0/0/8   16 3   0 100
kmalloc-65536               32   65536     2.0M          0/0/8    4 2   0 100
kmalloc-8                65536       8   524.2K          0/0/8 8192 0   0 100
kmalloc-8192               512    8192     4.1M          0/0/8   64 3   0 100
kmem_cache_node           3276      80   262.1K          0/0/4  819 0   0  99 *
mqueue_inode_cache          56    1064    65.5K          0/0/1   56 0   0  90 A
numa_policy                248     264    65.5K          0/0/1  248 0   0  99
proc_inode_cache           653     792   655.3K          2/2/8   81 0  20  78 a
radix_tree_node           1221     552   983.0K          7/7/8  117 0  46  68 a
shmem_inode_cache         1218    1000     1.3M         12/3/8   65 0  15  92
sighand_cache              416    1608   851.9K          5/3/8   39 0  23  78 A
sigqueue                  3272     160   524.2K          0/0/8  409 0   0  99
sock_inode_cache           758     832   786.4K          4/3/8   73 0  25  80 Aa
TCP                        180    1712   327.6K          0/0/5   36 0   0  94 A
vm_area_struct            4054     176   851.9K          5/5/8  372 0  38  83




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-18 10:34                                 ` KOSAKI Motohiro
@ 2008-08-18 14:08                                   ` Christoph Lameter
  2008-08-19 10:34                                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-18 14:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:

> new patch is here.
> 
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1326,9 +1326,11 @@ static struct page *get_any_partial(stru
>          * expensive if we do it every time we are trying to find a slab
>          * with available objects.
>          */
> +#if 0
>         if (!s->remote_node_defrag_ratio ||
>                         get_cycles() % 1024 > s->remote_node_defrag_ratio)
>                 return NULL;
> +#endif
> 
>         zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>         for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {

Hmmm.... So always take from partial lists works? That is the same effect that
the setting of the remote_defrag_ratio to 100 should have had (its multiplied
by 10 when storing it).

So its a NUMA only phenomenon. How is performance affected?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-18 14:08                                   ` Christoph Lameter
@ 2008-08-19 10:34                                     ` KOSAKI Motohiro
  2008-08-19 13:51                                       ` Christoph Lameter
  0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-19 10:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

> > +#if 0
> >         if (!s->remote_node_defrag_ratio ||
> >                         get_cycles() % 1024 > s->remote_node_defrag_ratio)
> >                 return NULL;
> > +#endif
> > 
> >         zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> >         for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> 
> Hmmm.... So always take from partial lists works? That is the same effect that
> the setting of the remote_defrag_ratio to 100 should have had (its multiplied
> by 10 when storing it).

Sorry, I don't know reason.
OK, I'll digg it more.

> So its a NUMA only phenomenon. How is performance affected?

Unfortunately, I can't mesure it.

because
   - Fujitsu server can access remote node fastly than typical numa server.
     So, my performance number often isn't typical.
   - My box (4G x2node) is very small in NUMA machine.
     but that is large server improving mechanism.

IOW, My box didn't happend performance regression.
but I think it isn't typical.




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-19 10:34                                     ` KOSAKI Motohiro
@ 2008-08-19 13:51                                       ` Christoph Lameter
  2008-08-20 11:46                                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-19 13:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel

KOSAKI Motohiro wrote:

> IOW, My box didn't happend performance regression.
> but I think it isn't typical.

Well that is typical for small NUMA system. Maybe this patch will fix it for
now? Large systems can be tuned by setting the ratio lower.


Subject: slub/NUMA: Disable remote node defragmentation by default

Switch remote node defragmentation off by default. The current settings can
cause excessive node local allocations with hackbench. (Note that this feature
is not related to slab defragmentation).

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-08-19 06:45:54.732348449 -0700
+++ linux-2.6/mm/slub.c	2008-08-19 06:46:12.442348249 -0700
@@ -2312,7 +2312,7 @@ static int kmem_cache_open(struct kmem_c

 	s->refcount = 1;
 #ifdef CONFIG_NUMA
-	s->remote_node_defrag_ratio = 100;
+	s->remote_node_defrag_ratio = 1000;
 #endif
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
@@ -4058,7 +4058,7 @@ static ssize_t remote_node_defrag_ratio_
 	if (err)
 		return err;

-	if (ratio < 100)
+	if (ratio <= 100)
 		s->remote_node_defrag_ratio = ratio * 10;

 	return length;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
  2008-08-19 13:51                                       ` Christoph Lameter
@ 2008-08-20 11:46                                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-20 11:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel

> KOSAKI Motohiro wrote:
> 
> > IOW, My box didn't happend performance regression.
> > but I think it isn't typical.
> 
> Well that is typical for small NUMA system. Maybe this patch will fix it for
> now? Large systems can be tuned by setting the ratio lower.
> 
> 
> Subject: slub/NUMA: Disable remote node defragmentation by default
> 
> Switch remote node defragmentation off by default. The current settings can
> cause excessive node local allocations with hackbench. (Note that this feature
> is not related to slab defragmentation).

OK.
I confirmed this patch works well.

Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/slub.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2008-08-19 06:45:54.732348449 -0700
> +++ linux-2.6/mm/slub.c	2008-08-19 06:46:12.442348249 -0700
> @@ -2312,7 +2312,7 @@ static int kmem_cache_open(struct kmem_c
> 
>  	s->refcount = 1;
>  #ifdef CONFIG_NUMA
> -	s->remote_node_defrag_ratio = 100;
> +	s->remote_node_defrag_ratio = 1000;
>  #endif
>  	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
>  		goto error;
> @@ -4058,7 +4058,7 @@ static ssize_t remote_node_defrag_ratio_
>  	if (err)
>  		return err;
> 
> -	if (ratio < 100)
> +	if (ratio <= 100)
>  		s->remote_node_defrag_ratio = ratio * 10;
> 
>  	return length;




^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2008-08-20 11:47 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-10  2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
2008-05-10  2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
2008-05-10  2:21 ` [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
2008-05-10  2:21 ` [patch 03/19] slub: Add get() and kick() methods Christoph Lameter
2008-05-10  2:21 ` [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
2008-05-10  2:21 ` [patch 05/19] slub: Slab defrag core Christoph Lameter
2008-05-10  2:21 ` [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
2008-05-10  2:21 ` [patch 07/19] slub: Extend slabinfo to support -D and -F options Christoph Lameter
2008-05-10  2:21 ` [patch 08/19] slub/slabinfo: add defrag statistics Christoph Lameter
2008-05-10  2:21 ` [patch 09/19] slub: Trigger defragmentation from memory reclaim Christoph Lameter
2008-05-10  2:21 ` [patch 10/19] buffer heads: Support slab defrag Christoph Lameter
2008-05-10  2:21 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
2008-05-10  2:21 ` [patch 12/19] Filesystem: Ext2 filesystem defrag Christoph Lameter
2008-05-10  2:21 ` [patch 13/19] Filesystem: Ext3 " Christoph Lameter
2008-05-10  2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
2008-08-03  1:54   ` Theodore Tso
2008-08-13  7:26     ` Pekka Enberg
2008-05-10  2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
2008-08-03  1:42   ` Dave Chinner
2008-08-04 13:36     ` Christoph Lameter
2008-05-10  2:21 ` [patch 16/19] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
2008-05-10  2:21 ` [patch 17/19] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
2008-05-10  2:21 ` [patch 18/19] dentries: Add constructor Christoph Lameter
2008-05-10  2:21 ` [patch 19/19] dentries: dentry defragmentation Christoph Lameter
2008-08-03  1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
2008-08-03 21:25   ` Pekka Enberg
2008-08-04  2:37     ` Rene Herman
2008-08-04 21:22       ` Pekka Enberg
2008-08-04 21:41         ` Christoph Lameter
2008-08-04 23:09           ` Rene Herman
2008-08-04 13:43   ` Christoph Lameter
2008-08-04 14:48     ` Jamie Lokier
2008-08-04 15:21       ` Jamie Lokier
2008-08-04 16:35         ` Christoph Lameter
2008-08-04 15:11     ` Rik van Riel
2008-08-04 16:02       ` Christoph Lameter
2008-08-04 16:47     ` KOSAKI Motohiro
2008-08-04 17:13       ` Christoph Lameter
2008-08-04 17:20         ` Pekka Enberg
2008-08-05 12:06         ` KOSAKI Motohiro
2008-08-05 14:59           ` Christoph Lameter
2008-08-06 12:36             ` KOSAKI Motohiro
2008-08-06 14:24               ` Christoph Lameter
2008-08-13 10:46             ` KOSAKI Motohiro
2008-08-13 13:10               ` Christoph Lameter
2008-08-13 14:14                 ` KOSAKI Motohiro
2008-08-13 14:16                   ` Pekka Enberg
2008-08-13 14:31                   ` Christoph Lameter
2008-08-13 15:05                     ` KOSAKI Motohiro
2008-08-14 19:44                       ` Christoph Lameter
2008-08-15 16:44                         ` KOSAKI Motohiro
2008-08-15 18:24                           ` Christoph Lameter
2008-08-15 19:42                             ` Christoph Lameter
2008-08-18 10:08                               ` KOSAKI Motohiro
2008-08-18 10:34                                 ` KOSAKI Motohiro
2008-08-18 14:08                                   ` Christoph Lameter
2008-08-19 10:34                                     ` KOSAKI Motohiro
2008-08-19 13:51                                       ` Christoph Lameter
2008-08-20 11:46                                         ` KOSAKI Motohiro
2008-08-14  7:15                 ` Pekka Enberg
2008-08-14 14:45                   ` Christoph Lameter
2008-08-14 15:06                   ` Christoph Lameter
2008-08-04 17:19       ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2008-08-11 15:06 [patch 00/19] Slab Fragmentation Reduction V14 Christoph Lameter
2008-08-11 15:06 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).