Slab Fragmentation Reduction V15

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Slab Fragmentation Reduction V15
@ 2010-01-29 20:49 Christoph Lameter
  2010-01-29 20:49 ` slub: Add defrag_ratio field and sysfs support Christoph Lameter
                   ` (19 more replies)
  0 siblings, 20 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi,
	Nick Piggin, Hugh Dickins, linux-kernel

This is one of these year long projects to address fundamental issues in the
Linux VM. The problem is that sparse use of objects in slab caches can cause
large amounts of memory to become unusable. The first ideas to address this
were developed in 2005 by various people. Some of the issues with SLAB that
we discovered while prototyping these ideas also contributed to the locking
design in SLUB which is highly decentralized and allows stabilizing the object
state slab wise by taking a per slab lock.

This patchset was first proposed in the beginning of 2007. It was almost merged
in 2008 when last minute objections arose in the way this interacts with
filesystem objects (inode/dentry).

Andi has asked that we reconsider this issue. So I have updated the patchset
to apply against current upstream (and also -next with a special patch
at the end). The issues with icache/dentry locking remain. In order
for this to be merged we would have to come up with a revised dentry/inode
locking code that can

	1. Establish a reference to an dentry/inode so that it is pinned.
           Hopefully in a way that is not too expensive (i.e. no superblock
           lock)

	2. A means to free a dentry/inode objects from the VM reclaim context.

Both of those do not need to work reliably and can fail. Reclaim is a heuristic
process after all. Failure to reclaim will make the allocator skip the slab on
future scans and use it for allocations instead. When all objects in a slab have
been used and an object is freed then the slab becomes subject to
VM reclaim scans again.

The other objection against this patchset was that it does not support
reclaim through SLAB. It is possible to add this type of support to SLAB too
but one would have to take the node l3 lock to lock down all objects on
a node (and purge the percpu caches beforehand). This would stop all
allocations during a reclaim pass on a slab and make targeted reclaim
much more expensive.

     Patch description

Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim without much succes.
Slab defragmentation adds the capability to recover the memory that
is wasted.

Memory reclaim for the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

	updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.

However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.

V14->V15
- Provide missing Documentation/ABI documentation pieces
- Add -next transition patch
- Re-add the dentry patch
- Put warnings into the patches with issues

V13->V14
- Rediff against linux-next on request of Andrew
- TestSetPageLocked -> trylock_page conversion.

V12->v13:
- Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
- Fix unitialized variable issue

V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
  use a timeout but track the amount of slab reclaim done by the shrinkers.
  Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
  /proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
  defrag (Either of those could be theoretically equipped to support
  slab defrag in some way but it seems that Andrew/Linus want to reduce
  the number of slab allocators).

V10->V11
- Simplify determination when to reclaim: Just scan over all partials
  and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
  were calling into reclaim very frequently when the system was under
  memory pressure which slowed things down. Various measures to
  avoid scanning the partial list too frequently were added and the
  earlier (expensive) method of determining the defrag ratio of the slab
  cache as a whole was dropped. I think this addresses the issues that
  Mel saw with V10.

V9->V10
- Rediff against upstream

V8->V9
- Rediff against 2.6.24-rc6-mm1

V7->V8
- Rediff against 2.6.24-rc3-mm2

V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
  handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
  Targeted reclaim never triggers. This has to wait until we make
  slabs movable or we need to perform a special version of lumpy reclaim
  in SLUB while we scan the partial lists for slabs to kick out.
  Removal simplifies handling significantly since we
  get to slabs in a more controlled way via the partial lists.
  The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
  has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Add defrag_ratio field and sysfs support.
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_add_defrag_ratio --]
[-- Type: text/plain, Size: 3945 bytes --]

The defrag_ratio is used to set the threshold at which defragmentation
should be attempted on a slab page.

The allocation ratio is measured by the percentage of the available slots
allocated.

Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
that less than 3 out of 10 available slots for objects are in use before
slab defragmeentation runs.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/ABI/testing/sysfs-kernel-slab |   13 +++++++++++++
 include/linux/slub_def.h                    |    6 ++++++
 mm/slub.c                                   |   23 +++++++++++++++++++++++
 3 files changed, 42 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-01-29 10:37:01.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2010-01-29 10:42:43.000000000 -0600
@@ -91,6 +91,12 @@ struct kmem_cache {
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	unsigned long min_partial;
+	int defrag_ratio;	/*
+				 * Ratio used to check the percentage of
+				 * objects allocate in a slab page.
+				 * If less than this ratio is allocated
+				 * then reclaim attempts are made.
+				 */
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-01-29 10:37:01.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-01-29 10:42:44.000000000 -0600
@@ -2494,6 +2494,7 @@ static int kmem_cache_open(struct kmem_c
 	 */
 	set_min_partial(s, ilog2(s->size));
 	s->refcount = 1;
+	s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -4317,6 +4318,27 @@ static ssize_t free_calls_show(struct km
 }
 SLAB_ATTR_RO(free_calls);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = strict_strtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio < 100)
+		s->defrag_ratio = ratio;
+	return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_NUMA
 static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -4441,6 +4463,7 @@ static struct attribute *slab_attrs[] = 
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif
Index: linux-2.6/Documentation/ABI/testing/sysfs-kernel-slab
===================================================================
--- linux-2.6.orig/Documentation/ABI/testing/sysfs-kernel-slab	2010-01-29 10:43:21.000000000 -0600
+++ linux-2.6/Documentation/ABI/testing/sysfs-kernel-slab	2010-01-29 10:47:19.000000000 -0600
@@ -180,6 +180,19 @@ Description:
 		list.  It can be written to clear the current count.
 		Available when CONFIG_SLUB_STATS is enabled.
 
+What:		/sys/kernel/slab/cache/defrag_ratio
+Date:		February 2010
+KernelVersion:	2.6.34
+Contact:	Christoph Lameter <cl@linux-foundation.org>
+		Pekka Enberg <penberg@cs.helsinki.fi>,
+Description:
+		The defrag_ratio files allows the control of how agressive
+		slab fragmentation reduction works at reclaiming objects from
+		sparsely populated slabs. This is a percentage. If a slab
+		contains less than this percentage of objects then reclaim
+		will attempt to reclaim objects so that the whole slab
+		page can be freed. The default is 30%.
+
 What:		/sys/kernel/slab/cache/deactivate_to_tail
 Date:		February 2008
 KernelVersion:	2.6.25

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Replace ctor field with ops field in /sys/slab/*
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
  2010-01-29 20:49 ` slub: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Add get() and kick() methods Christoph Lameter
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_replace_ctor_field --]
[-- Type: text/plain, Size: 1551 bytes --]

Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that will
be defined soon.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-01-29 10:27:09.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-01-29 10:27:14.000000000 -0600
@@ -4089,16 +4089,18 @@ static ssize_t min_partial_store(struct 
 }
 SLAB_ATTR(min_partial);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -4448,7 +4450,7 @@ static struct attribute *slab_attrs[] = 
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Add get() and kick() methods
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
  2010-01-29 20:49 ` slub: Add defrag_ratio field and sysfs support Christoph Lameter
  2010-01-29 20:49 ` slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_add_get_and_kick --]
[-- Type: text/plain, Size: 5619 bytes --]

Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the defrag methods for a slab cache.

Add empty functions for SLAB/SLOB. The API is generic so it
could be theoretically implemented for either allocator.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h     |   50 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/slub_def.h |    3 ++
 mm/slub.c                |   29 ++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-01-29 10:27:09.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2010-01-29 10:27:17.000000000 -0600
@@ -88,6 +88,9 @@ struct kmem_cache {
 	gfp_t allocflags;	/* gfp flags to use on each alloc */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *);
+	kmem_defrag_get_func *get;
+	kmem_defrag_kick_func *kick;
+
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	unsigned long min_partial;
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-01-29 10:27:14.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-01-29 10:27:17.000000000 -0600
@@ -2976,6 +2976,19 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
+{
+	/*
+	 * Defragmentable slabs must have a ctor otherwise objects may be
+	 * in an undetermined state after they are allocated.
+	 */
+	BUG_ON(!s->ctor);
+	s->get = get;
+	s->kick = kick;
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3288,7 +3301,7 @@ static int slab_unmergeable(struct kmem_
 	if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor)
+	if (s->ctor || s->kick || s->get)
 		return 1;
 
 	/*
@@ -4098,6 +4111,20 @@ static ssize_t ops_show(struct kmem_cach
 		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
 		x += sprintf(buf + x, "\n");
 	}
+
+	if (s->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2009-11-13 09:34:39.000000000 -0600
+++ linux-2.6/include/linux/slab.h	2010-01-29 10:27:17.000000000 -0600
@@ -140,6 +140,56 @@ void kzfree(const void *);
 size_t ksize(const void *);
 
 /*
+ * Function prototypes passed to kmem_cache_defrag() to enable defragmentation
+ * and targeted reclaim in slab caches.
+ */
+
+/*
+ * kmem_cache_defrag_get_func() is called with locks held so that the slab
+ * objects cannot be freed. We are in an atomic context and no slab
+ * operations may be performed. The purpose of kmem_cache_defrag_get_func()
+ * is to obtain a stable refcount on the objects, so that they cannot be
+ * removed until kmem_cache_kick_func() has handled them.
+ *
+ * Parameters passed are the number of objects to process and an array of
+ * pointers to objects for which we need references.
+ *
+ * Returns a pointer that is passed to the kick function. If any objects
+ * cannot be moved then the pointer may indicate a failure and
+ * then kick can simply remove the references that were already obtained.
+ *
+ * The object pointer array passed is also passed to kmem_cache_defrag_kick().
+ * The function may remove objects from the array by setting pointers to
+ * NULL. This is useful if we can determine that an object is already about
+ * to be removed. In that case it is often impossible to obtain the necessary
+ * refcount.
+ */
+typedef void *kmem_defrag_get_func(struct kmem_cache *, int, void **);
+
+/*
+ * kmem_cache_defrag_kick_func is called with no locks held and interrupts
+ * enabled. Sleeping is possible. Any operation may be performed in kick().
+ * kmem_cache_defrag should free all the objects in the pointer array.
+ *
+ * Parameters passed are the number of objects in the array, the array of
+ * pointers to the objects and the pointer returned by kmem_cache_defrag_get().
+ *
+ * Success is checked by examining the number of remaining objects in the slab.
+ */
+typedef void kmem_defrag_kick_func(struct kmem_cache *, int, void **, void *);
+
+/*
+ * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
+						kmem_defrag_kick_func);
+#else
+static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+#endif
+
+/*
  * Allocator specific definitions. These are mainly used to establish optimized
  * ways to convert kmalloc() calls to kmem_cache_alloc() invocations by
  * selecting the appropriate general cache at compile time.

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Sort slab cache list and establish maximum objects for defrag slabs
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (2 preceding siblings ...)
  2010-01-29 20:49 ` slub: Add get() and kick() methods Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Slab defrag core Christoph Lameter
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_sort_slab_cache_list --]
[-- Type: text/plain, Size: 2781 bytes --]

When defragmenting slabs then it is advantageous to have all
defragmentable slabs together at the beginning of the list so that there is
no need to scan the complete list. Put defragmentable caches first when adding
a slab cache and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
to size the allocation of arrays holding refs to these objects later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-01-29 10:27:17.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-01-29 10:27:21.000000000 -0600
@@ -189,6 +189,9 @@ static enum {
 static DECLARE_RWSEM(slub_lock);
 static LIST_HEAD(slab_caches);
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects;
+
 /*
  * Tracking user of a slab.
  */
@@ -2707,7 +2710,7 @@ static struct kmem_cache *create_kmalloc
 								flags, NULL))
 		goto panic;
 
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 
 	if (sysfs_slab_add(s))
 		goto panic;
@@ -2976,9 +2979,23 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Allocate a slab scratch space that is sufficient to keep at least
+ * max_defrag_slab_objects pointers to individual objects and also a bitmap
+ * for max_defrag_slab_objects.
+ */
+static inline void *alloc_scratch(void)
+{
+	return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+		BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+		GFP_KERNEL);
+}
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
 {
+	int max_objects = oo_objects(s->max);
+
 	/*
 	 * Defragmentable slabs must have a ctor otherwise objects may be
 	 * in an undetermined state after they are allocated.
@@ -2986,6 +3003,11 @@ void kmem_cache_setup_defrag(struct kmem
 	BUG_ON(!s->ctor);
 	s->get = get;
 	s->kick = kick;
+	down_write(&slub_lock);
+	list_move(&s->list, &slab_caches);
+	if (max_objects > max_defrag_slab_objects)
+		max_defrag_slab_objects = max_objects;
+	up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
@@ -3397,7 +3419,7 @@ struct kmem_cache *kmem_cache_create(con
 	if (s) {
 		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor)) {
-			list_add(&s->list, &slab_caches);
+			list_add_tail(&s->list, &slab_caches);
 			up_write(&slub_lock);
 			if (sysfs_slab_add(s)) {
 				down_write(&slub_lock);

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Slab defrag core
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (3 preceding siblings ...)
  2010-01-29 20:49 ` slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_defrag_core --]
[-- Type: text/plain, Size: 13052 bytes --]

Slab defragmentation may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Through the use of the slabinfo command.

3. Per node defrag conditionally when kmem_cache_defrag(<node>) is called
   (can be called from reclaim code with a later patch).

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab page can accomodate.

   The scanning of slab caches is optimized because the
   defragmentable slabs come first on the list. Thus we can terminate scans
   on the first slab encountered that does not support defragmentation.

   kmem_cache_defrag() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are

kmem_defrag_get_func (void *get(struct kmem_cache *s, int nr, void **objects))

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free() attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect such situations and void the attempts to free such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page that contains the object is taken. Any attempt to perform
	a slab operation may lead to a deadlock.

	kmem_defrag_get_func returns a private pointer that is passed to
	kmem_defrag_kick_func(). Should we be unable to obtain all references
	then that pointer may indicate to the kick() function that it should
	not attempt any object removal or move but simply remove the
	reference counts.

kmem_defrag_kick_func (void kick(struct kmem_cache *, int nr, void **objects,
							void *get_result))

	After SLUB has established references to the objects in a
	slab it will then drop all locks and use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via kmem_defrag_get_func(). The
	callback may perform any slab operation since no locks are held at
	the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab. A filled up slab can also be removed
	from the partial list. So there could be a double effect.

	kmem_defrag_kick_func() does not return a result. SLUB will check
	the number of remaining objects in the slab. If all objects were
	removed then the slab is freed and we have reduced the overall
	fragmentation of the slab cache.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h |    3 
 mm/slub.c            |  265 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 215 insertions(+), 53 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-01-29 10:27:21.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-01-29 10:27:24.000000000 -0600
@@ -132,10 +132,10 @@
 
 /*
  * Maximum number of desirable partial slabs.
- * The existence of more partial slabs makes kmem_cache_shrink
- * sort the partial list by the number of objects in the.
+ * More slabs cause kmem_cache_shrink to sort the slabs by objects
+ * and triggers slab defragmentation.
  */
-#define MAX_PARTIAL 10
+#define MAX_PARTIAL 20
 
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
@@ -3012,76 +3012,235 @@ void kmem_cache_setup_defrag(struct kmem
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * Vacate all objects in the given slab.
  *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
  */
-int kmem_cache_shrink(struct kmem_cache *s)
+static int kmem_cache_vacate(struct page *page, void *scratch)
 {
-	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	struct kmem_cache *s;
+	unsigned long *map;
+	int leftover;
+	int count;
+	void *private;
 	unsigned long flags;
+	unsigned long objects;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	local_irq_save(flags);
+	slab_lock(page);
 
-	flush_all(s);
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		n = get_node(s, node);
+	BUG_ON(!PageSlab(page));	/* Must be s slab page */
+	BUG_ON(!SlabFrozen(page));	/* Slab must have been frozen earlier */
+
+	s = page->slab;
+	objects = page->objects;
+	map = scratch + objects * sizeof(void **);
+	if (!page->inuse || !s->kick)
+		goto out;
+
+	/* Determine used objects */
+	bitmap_fill(map, objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
+
+	/* Build vector of pointers to objects */
+	count = 0;
+	memset(vector, 0, objects * sizeof(void **));
+	for_each_object(p, s, addr, objects)
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[count++] = p;
+
+	private = s->get(s, count, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->kick(s, count, vector, private);
+
+	local_irq_save(flags);
+	slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page, leftover > 0);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Remove objects from a list of slab pages that have been gathered.
+ * Must be called with slabs that have been isolated before.
+ *
+ * kmem_cache_reclaim() is never called from an atomic context. It
+ * allocates memory for temporary storage. We are holding the
+ * slub_lock semaphore which prevents another call into
+ * the defrag logic.
+ */
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+	int freed = 0;
+	void **scratch;
+	struct page *page;
+	struct page *page2;
+
+	if (list_empty(zaplist))
+		return 0;
+
+	scratch = alloc_scratch();
+	if (!scratch)
+		return 0;
+
+	list_for_each_entry_safe(page, page2, zaplist, lru) {
+		list_del(&page->lru);
+		if (kmem_cache_vacate(page, scratch) == 0)
+			freed++;
+	}
+	kfree(scratch);
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ * by releasing slabs with zero objects and trying to reclaim
+ * slabs with less than the configured percentage of objects allocated.
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s, int node,
+							unsigned long limit)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed = 0;
+	struct kmem_cache_node *n = get_node(s, node);
 
-		if (!n->nr_partial)
+	if (n->nr_partial <= limit)
+		return 0;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+		if (page->inuse) {
+			if (page->inuse * 100 >=
+					s->defrag_ratio * page->objects) {
+				slab_unlock(page);
+				/* Slab contains enough objects */
+				continue;
+			}
 
-		spin_lock_irqsave(&n->list_lock, flags);
+			list_move(&page->lru, &zaplist);
+			if (s->kick) {
+				n->nr_partial--;
+				SetSlabFrozen(page);
+			}
+			slab_unlock(page);
+		} else {
+			/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		}
+	}
 
+	if (!s->kick)
 		/*
-		 * Build lists indexed by the items in use in each slab.
+		 * No defrag methods. By simply putting the zaplist at the
+		 * end of the partial list we can let them simmer longer
+		 * and thus increase the chance of all objects being
+		 * reclaimed.
 		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
+		 * We have effectively sorted the partial list and put
+		 * the slabs with more objects first. As soon as they
+		 * are allocated they are going to be removed from the
+		 * partial list.
 		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
-			}
-		}
+		list_splice(&zaplist, n->partial.prev);
+
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->kick)
+		freed += kmem_cache_reclaim(&zaplist);
+
+	return freed;
+}
+
+/*
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long slabs = 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		unsigned long reclaimed = 0;
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * Defragmentable caches come first. If the slab cache is not
+		 * defragmentable then we can stop traversing the list.
 		 */
-		for (i = objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (!s->kick)
+			break;
 
-		spin_unlock_irqrestore(&n->list_lock, flags);
+		if (node == -1) {
+			int nid;
+
+			for_each_node_state(nid, N_NORMAL_MEMORY)
+				reclaimed += __kmem_cache_shrink(s, nid,
+								MAX_PARTIAL);
+		} else
+			reclaimed = __kmem_cache_shrink(s, node, MAX_PARTIAL);
+
+		slabs += reclaimed;
 	}
+	up_read(&slub_lock);
+	return slabs;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists.
+ * If the slab cache supports defragmentation then objects are
+ * reclaimed.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		__kmem_cache_shrink(s, node, 0);
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2010-01-29 10:27:17.000000000 -0600
+++ linux-2.6/include/linux/slab.h	2010-01-29 10:27:24.000000000 -0600
@@ -180,13 +180,16 @@ typedef void kmem_defrag_kick_func(struc
 
 /*
  * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ * kmem_cache_defrag() performs the actual defragmentation.
  */
 #ifdef CONFIG_SLUB
 void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
 						kmem_defrag_kick_func);
+int kmem_cache_defrag(int node);
 #else
 static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+static inline int kmem_cache_defrag(int node) { return 0; }
 #endif
 
 /*

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Add KICKABLE to avoid repeated kick() attempts
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (4 preceding siblings ...)
  2010-01-29 20:49 ` slub: Slab defrag core Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Extend slabinfo to support -D and -F options Christoph Lameter
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_add_kickable --]
[-- Type: text/plain, Size: 3829 bytes --]

Add a flag KICKABLE to be set on slabs with a defragmentation method

Clear the flag if a kick action is not successful in reducing the
number of objects in a slab. This will avoid future attempts to
kick objects out.

The KICKABLE flag is set again when all objects of the slab have been
allocated (Occurs during removal of a slab from the partial lists).

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/page-flags.h |    2 ++
 mm/slub.c                  |   23 ++++++++++++++++++-----
 2 files changed, 20 insertions(+), 5 deletions(-)

Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c	2010-01-22 15:47:48.000000000 -0600
+++ slab-2.6/mm/slub.c	2010-01-22 15:49:30.000000000 -0600
@@ -1168,6 +1168,9 @@ static struct page *new_slab(struct kmem
 			SLAB_STORE_USER | SLAB_TRACE))
 		__SetPageSlubDebug(page);
 
+	if (s->kick)
+		__SetPageSlubKickable(page);
+
 	start = page_address(page);
 
 	if (unlikely(s->flags & SLAB_POISON))
@@ -1210,6 +1213,7 @@ static void __free_slab(struct kmem_cach
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	__ClearPageSlubKickable(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	if (current->reclaim_state)
@@ -1421,6 +1425,8 @@ static void unfreeze_slab(struct kmem_ca
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
+			if (s->kick)
+				__SetPageSlubKickable(page);
 		}
 		slab_unlock(page);
 	} else {
@@ -2905,12 +2911,12 @@ static int kmem_cache_vacate(struct page
 	slab_lock(page);
 
 	BUG_ON(!PageSlab(page));	/* Must be s slab page */
-	BUG_ON(!SlabFrozen(page));	/* Slab must have been frozen earlier */
+	BUG_ON(!PageSlubFrozen(page));	/* Slab must have been frozen earlier */
 
 	s = page->slab;
 	objects = page->objects;
 	map = scratch + objects * sizeof(void **);
-	if (!page->inuse || !s->kick)
+	if (!page->inuse || !s->kick || !PageSlubKickable(page))
 		goto out;
 
 	/* Determine used objects */
@@ -2948,6 +2954,9 @@ out:
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
+	if (leftover)
+		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+		__ClearPageSlubKickable(page);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -3009,17 +3018,21 @@ static unsigned long __kmem_cache_shrink
 			continue;
 
 		if (page->inuse) {
-			if (page->inuse * 100 >=
+			if (!PageSlubKickable(page) || page->inuse * 100 >=
 					s->defrag_ratio * page->objects) {
 				slab_unlock(page);
-				/* Slab contains enough objects */
+				/*
+				 * Slab contains enough objects
+				 * or we alrady tried reclaim before and
+				 * it failed. Skip this one.
+				 */
 				continue;
 			}
 
 			list_move(&page->lru, &zaplist);
 			if (s->kick) {
 				n->nr_partial--;
-				SetSlabFrozen(page);
+				__SetPageSlubFrozen(page);
 			}
 			slab_unlock(page);
 		} else {
Index: slab-2.6/include/linux/page-flags.h
===================================================================
--- slab-2.6.orig/include/linux/page-flags.h	2010-01-22 15:09:43.000000000 -0600
+++ slab-2.6/include/linux/page-flags.h	2010-01-22 15:49:30.000000000 -0600
@@ -129,6 +129,7 @@ enum pageflags {
 	/* SLUB */
 	PG_slub_frozen = PG_active,
 	PG_slub_debug = PG_error,
+	PG_slub_kickable = PG_dirty,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -216,6 +217,7 @@ __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
 __PAGEFLAG(SlubDebug, slub_debug)
+__PAGEFLAG(SlubKickable, slub_kickable)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Extend slabinfo to support -D and -F options
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (5 preceding siblings ...)
  2010-01-29 20:49 ` slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub/slabinfo: add defrag statistics Christoph Lameter
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_extend_slabinfo --]
[-- Type: text/plain, Size: 6185 bytes --]

-F lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/vm/slabinfo.c |   48 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

Index: slab-2.6/Documentation/vm/slabinfo.c
===================================================================
--- slab-2.6.orig/Documentation/vm/slabinfo.c	2010-01-22 15:09:38.000000000 -0600
+++ slab-2.6/Documentation/vm/slabinfo.c	2010-01-22 15:53:03.000000000 -0600
@@ -31,6 +31,8 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
+	int defrag_ratio, remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
 	unsigned long free_fastpath, free_slowpath;
@@ -64,6 +66,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -100,13 +104,15 @@ static void fatal(const char *x, ...)
 static void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
 		"-A|--activity          Most active slabs first\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-D|--display-active    Switch line format to activity\n"
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
+		"-F|--defrag            Show defragmentable caches\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -296,7 +302,7 @@ static void first_line(void)
 		printf("Name                   Objects      Alloc       Free   %%Fast Fallb O\n");
 	else
 		printf("Name                   Objects Objsize    Space "
-			"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+			"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -345,7 +351,7 @@ static void slab_numa(struct slabinfo *s
 		return;
 
 	if (!line) {
-		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
 			printf(" %4d", node);
 		printf("\n----------------------");
@@ -354,6 +360,7 @@ static void slab_numa(struct slabinfo *s
 		printf("\n");
 	}
 	printf("%-21s ", mode ? "All slabs" : s->name);
+	printf("%3d ", s->remote_node_defrag_ratio);
 	for(node = 0; node <= highest_node; node++) {
 		char b[20];
 
@@ -492,6 +499,8 @@ static void report(struct slabinfo *s)
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->defrag)
+		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -539,6 +548,12 @@ static void slabcache(struct slabinfo *s
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
 						s->partial, s->cpu_slabs);
@@ -550,6 +565,10 @@ static void slabcache(struct slabinfo *s
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'F';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -584,7 +603,8 @@ static void slabcache(struct slabinfo *s
 		printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
 			s->objs_per_slab, s->order,
-			s->slabs ? (s->partial * 100) / s->slabs : 100,
+			s->slabs ? (s->partial * 100) /
+					(s->slabs * s->objs_per_slab) : 100,
 			s->slabs ? (s->objects * s->object_size * 100) /
 				(s->slabs * (page_size << s->order)) : 100,
 			flags);
@@ -1190,7 +1210,17 @@ static void read_slab_dir(void)
 			slab->deactivate_to_tail = get_obj("deactivate_to_tail");
 			slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
 			slab->order_fallback = get_obj("order_fallback");
+			slab->defrag_ratio = get_obj("defrag_ratio");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
+
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1241,10 +1271,12 @@ static void output_slabs(void)
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
 	{ "activity", 0, NULL, 'A' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
 	{ "display-activity", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
+	{ "defrag", 0, NULL, 'F' },
 	{ "help", 0, NULL, 'h' },
 	{ "inverted", 0, NULL, 'i'},
 	{ "numa", 0, NULL, 'n' },
@@ -1267,7 +1299,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1323,6 +1355,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'F':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub/slabinfo: add defrag statistics
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (6 preceding siblings ...)
  2010-01-29 20:49 ` slub: Extend slabinfo to support -D and -F options Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` slub: Trigger defragmentation from memory reclaim Christoph Lameter
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_add_defrag_stats --]
[-- Type: text/plain, Size: 9359 bytes --]

Add statistics counters for slab defragmentation.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/vm/slabinfo.c |   45 ++++++++++++++++++++++++++++++++++++--------
 include/linux/slub_def.h    |    6 +++++
 mm/slub.c                   |   24 ++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 9 deletions(-)

Index: slab-2.6/Documentation/vm/slabinfo.c
===================================================================
--- slab-2.6.orig/Documentation/vm/slabinfo.c	2010-01-22 15:53:03.000000000 -0600
+++ slab-2.6/Documentation/vm/slabinfo.c	2010-01-22 15:53:21.000000000 -0600
@@ -41,6 +41,9 @@ struct slabinfo {
 	unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
 	unsigned long deactivate_to_head, deactivate_to_tail;
 	unsigned long deactivate_remote_frees, order_fallback;
+	unsigned long shrink_calls, shrink_attempt_defrag, shrink_empty_slab;
+	unsigned long shrink_slab_skipped, shrink_slab_reclaimed;
+	unsigned long shrink_object_reclaim_failed;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
 } slabinfo[MAX_SLABS];
@@ -79,6 +82,7 @@ int sort_active = 0;
 int set_debug = 0;
 int show_ops = 0;
 int show_activity = 0;
+int show_defragcount = 0;
 
 /* Debug options */
 int sanity = 0;
@@ -113,6 +117,7 @@ static void usage(void)
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-F|--defrag            Show defragmentable caches\n"
+		"-G:--display-defrag    Display defrag counters\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -300,6 +305,8 @@ static void first_line(void)
 {
 	if (show_activity)
 		printf("Name                   Objects      Alloc       Free   %%Fast Fallb O\n");
+	else if (show_defragcount)
+		printf("Name                   Objects DefragRQ  Slabs Success   Empty Skipped  Failed\n");
 	else
 		printf("Name                   Objects Objsize    Space "
 			"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
@@ -466,22 +473,28 @@ static void slab_stats(struct slabinfo *
 
 	printf("Total                %8lu %8lu\n\n", total_alloc, total_free);
 
-	if (s->cpuslab_flush)
-		printf("Flushes %8lu\n", s->cpuslab_flush);
-
-	if (s->alloc_refill)
-		printf("Refill %8lu\n", s->alloc_refill);
+	if (s->cpuslab_flush || s->alloc_refill)
+		printf("CPU Slab  : Flushes=%lu Refills=%lu\n",
+			s->cpuslab_flush, s->alloc_refill);
 
 	total = s->deactivate_full + s->deactivate_empty +
 			s->deactivate_to_head + s->deactivate_to_tail;
 
 	if (total)
-		printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
+		printf("Deactivate: Full=%lu(%lu%%) Empty=%lu(%lu%%) "
 			"ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
 			s->deactivate_full, (s->deactivate_full * 100) / total,
 			s->deactivate_empty, (s->deactivate_empty * 100) / total,
 			s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
 			s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
+
+	if (s->shrink_calls)
+		printf("Shrink    : Calls=%lu Attempts=%lu Empty=%lu Successful=%lu\n",
+			s->shrink_calls, s->shrink_attempt_defrag,
+			s->shrink_empty_slab, s->shrink_slab_reclaimed);
+	if (s->shrink_slab_skipped || s->shrink_object_reclaim_failed)
+		printf("Defrag    : Slabs skipped=%lu Object reclaim failed=%lu\n",
+		s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
 }
 
 static void report(struct slabinfo *s)
@@ -598,7 +611,12 @@ static void slabcache(struct slabinfo *s
 			total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
 			total_free ? (s->free_fastpath * 100 / total_free) : 0,
 			s->order_fallback, s->order);
-	}
+	} else
+	if (show_defragcount)
+		printf("%-21s %8ld %7d %7d %7d %7d %7d %7d\n",
+			s->name, s->objects, s->shrink_calls, s->shrink_attempt_defrag,
+			s->shrink_slab_reclaimed, s->shrink_empty_slab,
+			s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
 	else
 		printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
@@ -1210,6 +1228,13 @@ static void read_slab_dir(void)
 			slab->deactivate_to_tail = get_obj("deactivate_to_tail");
 			slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
 			slab->order_fallback = get_obj("order_fallback");
+			slab->shrink_calls = get_obj("shrink_calls");
+			slab->shrink_attempt_defrag = get_obj("shrink_attempt_defrag");
+			slab->shrink_empty_slab = get_obj("shrink_empty_slab");
+			slab->shrink_slab_skipped = get_obj("shrink_slab_skipped");
+			slab->shrink_slab_reclaimed = get_obj("shrink_slab_reclaimed");
+			slab->shrink_object_reclaim_failed =
+					get_obj("shrink_object_reclaim_failed");
 			slab->defrag_ratio = get_obj("defrag_ratio");
 			slab->remote_node_defrag_ratio =
 					get_obj("remote_node_defrag_ratio");
@@ -1274,6 +1299,7 @@ struct option opts[] = {
 	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
 	{ "display-activity", 0, NULL, 'D' },
+	{ "display-defrag", 0, NULL, 'G' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "defrag", 0, NULL, 'F' },
@@ -1299,7 +1325,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "aACd::DefFGhil1noprstvzTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1325,6 +1351,9 @@ int main(int argc, char *argv[])
 		case 'f':
 			show_first_alias = 1;
 			break;
+		case 'G':
+			show_defragcount = 1;
+			break;
 		case 'h':
 			usage();
 			return 0;
Index: slab-2.6/include/linux/slub_def.h
===================================================================
--- slab-2.6.orig/include/linux/slub_def.h	2010-01-22 15:41:35.000000000 -0600
+++ slab-2.6/include/linux/slub_def.h	2010-01-22 15:53:21.000000000 -0600
@@ -32,6 +32,12 @@ enum stat_item {
 	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
+	SHRINK_CALLS,		/* Number of invocations of kmem_cache_shrink */
+	SHRINK_ATTEMPT_DEFRAG,	/* Slabs that were attempted to be reclaimed */
+	SHRINK_EMPTY_SLAB,	/* Shrink encountered and freed empty slab */
+	SHRINK_SLAB_SKIPPED,	/* Slab reclaim skipped an slab (busy etc) */
+	SHRINK_SLAB_RECLAIMED,	/* Successfully reclaimed slabs */
+	SHRINK_OBJECT_RECLAIM_FAILED, /* Callbacks signaled busy objects */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c	2010-01-22 15:51:32.000000000 -0600
+++ slab-2.6/mm/slub.c	2010-01-22 15:53:21.000000000 -0600
@@ -2906,6 +2906,7 @@ static int kmem_cache_vacate(struct page
 	void *private;
 	unsigned long flags;
 	unsigned long objects;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
 	slab_lock(page);
@@ -2954,9 +2955,13 @@ out:
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
-	if (leftover)
+	c = get_cpu_slab(s, smp_processor_id());
+	if (leftover) {
 		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+		stat(c, SHRINK_OBJECT_RECLAIM_FAILED);
 		__ClearPageSlubKickable(page);
+	} else
+		stat(c, SHRINK_SLAB_RECLAIMED);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -3007,11 +3012,14 @@ static unsigned long __kmem_cache_shrink
 	LIST_HEAD(zaplist);
 	int freed = 0;
 	struct kmem_cache_node *n = get_node(s, node);
+	struct kmem_cache_cpu *c;
 
 	if (n->nr_partial <= limit)
 		return 0;
 
 	spin_lock_irqsave(&n->list_lock, flags);
+	c = get_cpu_slab(s, smp_processor_id());
+	stat(c, SHRINK_CALLS);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
 		if (!slab_trylock(page))
 			/* Busy slab. Get out of the way */
@@ -3031,12 +3039,14 @@ static unsigned long __kmem_cache_shrink
 
 			list_move(&page->lru, &zaplist);
 			if (s->kick) {
+				stat(c, SHRINK_ATTEMPT_DEFRAG);
 				n->nr_partial--;
 				__SetPageSlubFrozen(page);
 			}
 			slab_unlock(page);
 		} else {
 			/* Empty slab page */
+			stat(c, SHRINK_EMPTY_SLAB);
 			list_del(&page->lru);
 			n->nr_partial--;
 			slab_unlock(page);
@@ -4503,6 +4513,12 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate
 STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(SHRINK_CALLS, shrink_calls);
+STAT_ATTR(SHRINK_ATTEMPT_DEFRAG, shrink_attempt_defrag);
+STAT_ATTR(SHRINK_EMPTY_SLAB, shrink_empty_slab);
+STAT_ATTR(SHRINK_SLAB_SKIPPED, shrink_slab_skipped);
+STAT_ATTR(SHRINK_SLAB_RECLAIMED, shrink_slab_reclaimed);
+STAT_ATTR(SHRINK_OBJECT_RECLAIM_FAILED, shrink_object_reclaim_failed);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -4558,6 +4574,12 @@ static struct attribute *slab_attrs[] = 
 	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&order_fallback_attr.attr,
+	&shrink_calls_attr.attr,
+	&shrink_attempt_defrag_attr.attr,
+	&shrink_empty_slab_attr.attr,
+	&shrink_slab_skipped_attr.attr,
+	&shrink_slab_reclaimed_attr.attr,
+	&shrink_object_reclaim_failed_attr.attr,
 #endif
 	NULL
 };

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub: Trigger defragmentation from memory reclaim
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (7 preceding siblings ...)
  2010-01-29 20:49 ` slub/slabinfo: add defrag statistics Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` buffer heads: Support slab defrag Christoph Lameter
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Pekka Enberg, Rik van Riel, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: slub_vmscan_trigger --]
[-- Type: text/plain, Size: 9408 bytes --]

This patch triggers slab defragmentation from memory reclaim. The logical
point for this is after slab shrinking was performed in vmscan.c. At that point
the fragmentation ratio of a slab was increased because objects were freed via
the LRU lists maitained for various slab caches.
So we call kmem_cache_defrag() from there.

shrink_slab() is called in some contexts to do global shrinking
of slabs and in others to do shrinking for a particular zone. Pass the zone to
shrink_slab(), so that slab_shrink() can call kmem_cache_defrag() and restrict
the defragmentation to the node that is under memory pressure.

The callback frequency into slab reclaim can be controlled by a new field
/proc/sys/vm/slab_defrag_limit.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 Documentation/sysctl/vm.txt |   10 +++++++
 fs/drop_caches.c            |    2 -
 include/linux/mm.h          |    3 --
 include/linux/mmzone.h      |    1 
 include/linux/swap.h        |    3 ++
 kernel/sysctl.c             |   20 +++++++++++++++
 mm/vmscan.c                 |   58 ++++++++++++++++++++++++++++++++++++++++----
 7 files changed, 90 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2009-11-13 09:34:25.000000000 -0600
+++ linux-2.6/fs/drop_caches.c	2010-01-29 10:27:32.000000000 -0600
@@ -58,7 +58,7 @@ static void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-01-20 11:39:58.000000000 -0600
+++ linux-2.6/include/linux/mm.h	2010-01-29 10:27:32.000000000 -0600
@@ -1308,8 +1308,7 @@ int in_gate_area_no_task(unsigned long a
 int drop_caches_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
-
+				unsigned long lru_pages, struct zone *z);
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
 #else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-01-19 12:38:15.000000000 -0600
+++ linux-2.6/mm/vmscan.c	2010-01-29 10:27:32.000000000 -0600
@@ -181,6 +181,14 @@ void unregister_shrinker(struct shrinker
 EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
+
+/*
+ * Trigger a call into slab defrag if the sum of the returns from
+ * shrinkers cross this value.
+ */
+int slab_defrag_limit = 1000;
+int slab_defrag_counter;
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -198,10 +206,18 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -259,6 +275,39 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+
+
+	/* Avoid dirtying cachelines */
+	if (!ret)
+		return 0;
+
+	/*
+	 * "ret" doesnt really contain the freed object count. The shrinkers
+	 * fake it. Gotta go with what we are getting though.
+	 *
+	 * Handling of the defrag_counter is also racy. If we get the
+	 * wrong counts then we may unnecessarily do a defrag pass or defer
+	 * one. "ret" is already faked. So this is just increasing
+	 * the already existing fuzziness to get some notion as to when
+	 * to initiate slab defrag which will hopefully be okay.
+	 */
+	if (zone) {
+		/* balance_pgdat running on a zone so we only scan one node */
+		zone->slab_defrag_counter += ret;
+		if (zone->slab_defrag_counter > slab_defrag_limit &&
+						(gfp_mask & __GFP_FS)) {
+			zone->slab_defrag_counter = 0;
+			kmem_cache_defrag(zone_to_nid(zone));
+		}
+	} else {
+		/* Direct (and thus global) reclaim. Scan all nodes */
+		slab_defrag_counter += ret;
+		if (slab_defrag_counter > slab_defrag_limit &&
+						(gfp_mask & __GFP_FS)) {
+			slab_defrag_counter = 0;
+			kmem_cache_defrag(-1);
+		}
+	}
 	return ret;
 }
 
@@ -1768,7 +1817,7 @@ static unsigned long do_try_to_free_page
 		 * over limit cgroups
 		 */
 		if (scanning_global_lru(sc)) {
-			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
+			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages, NULL);
 			if (reclaim_state) {
 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 				reclaim_state->reclaimed_slab = 0;
@@ -2084,7 +2133,7 @@ loop_again:
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone_is_all_unreclaimable(zone))
@@ -2578,7 +2627,8 @@ static int __zone_reclaim(struct zone *z
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2010-01-20 11:39:58.000000000 -0600
+++ linux-2.6/include/linux/mmzone.h	2010-01-29 10:27:32.000000000 -0600
@@ -340,6 +340,7 @@ struct zone {
 	struct zone_reclaim_stat reclaim_stat;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
+	unsigned long		slab_defrag_counter; /* since last defrag */
 	unsigned long		flags;		   /* zone flags, see below */
 
 	/* Zone statistics */
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/include/linux/swap.h	2010-01-29 10:27:32.000000000 -0600
@@ -252,6 +252,9 @@ extern unsigned long mem_cgroup_shrink_n
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
+extern int slab_defrag_limit;
+extern int slab_defrag_counter;
+
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/kernel/sysctl.c	2010-01-29 10:27:32.000000000 -0600
@@ -1167,6 +1167,26 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_limit",
+		.data		= &slab_defrag_limit,
+		.maxlen		= sizeof(slab_defrag_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one_hundred,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_count",
+		.data		= &slab_defrag_counter,
+		.maxlen		= sizeof(slab_defrag_counter),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 	{
 		.procname	= "legacy_va_layout",
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2009-12-10 12:18:32.000000000 -0600
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-01-29 10:27:32.000000000 -0600
@@ -50,6 +50,7 @@ Currently, these files are in /proc/sys/
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- slab_defrag_limit
 - stat_interval
 - swappiness
 - vfs_cache_pressure
@@ -597,6 +598,15 @@ The initial value is zero.  Kernel does 
 the high water marks for each per cpu page list.
 
 ==============================================================
+slab_defrag_limit
+
+Determines the frequency of calls from reclaim into slab defragmentation.
+Slab defrag reclaims objects from sparsely populates slab pages.
+The default is 1000. Increase if slab defragmentation occurs
+too frequently. Decrease if more slab defragmentation passes
+are needed. The slabinfo tool can report on the frequency of the callbacks.
+
+==============================================================
 
 stat_interval
 

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* buffer heads: Support slab defrag
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (8 preceding siblings ...)
  2010-01-29 20:49 ` slub: Trigger defragmentation from memory reclaim Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-30  1:59   ` Dave Chinner
  2010-02-01  6:39   ` Nick Piggin
  2010-01-29 20:49 ` inodes: Support generic defragmentation Christoph Lameter
                   ` (9 subsequent siblings)
  19 siblings, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: defrag_buffer_head --]
[-- Type: text/plain, Size: 3281 bytes --]

Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/buffer.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

Index: slab-2.6/fs/buffer.c
===================================================================
--- slab-2.6.orig/fs/buffer.c	2010-01-22 15:09:43.000000000 -0600
+++ slab-2.6/fs/buffer.c	2010-01-22 16:17:27.000000000 -0600
@@ -3352,6 +3352,104 @@ int bh_submit_read(struct buffer_head *b
 }
 EXPORT_SYMBOL(bh_submit_read);
 
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int rc;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = 1,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 0
+	};
+
+	if (!mapping->a_ops->writepage)
+		/* No write method for the address space */
+		return;
+
+	if (!clear_page_dirty_for_io(page))
+		/* Someone else already triggered a write */
+		return;
+
+	rc = mapping->a_ops->writepage(page, &wbc);
+	if (rc < 0)
+		/* I/O Error writing */
+		return;
+
+	if (rc == AOP_WRITEPAGE_ACTIVATE)
+		unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+	struct page *page;
+	struct buffer_head *bh;
+	int i, j;
+	int n = 0;
+
+	for (i = 0; i < nr; i++) {
+		bh = v[i];
+		v[i] = NULL;
+
+		page = bh->b_page;
+
+		if (page && PagePrivate(page)) {
+			for (j = 0; j < n; j++)
+				if (page == v[j])
+					continue;
+		}
+
+		if (get_page_unless_zero(page))
+			v[n++] = page;
+	}
+	return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was set up by get_buffer().
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+							void *private)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		page = v[i];
+
+		if (!page || PageWriteback(page))
+			continue;
+
+		if (trylock_page(page)) {
+			if (PageDirty(page))
+				trigger_write(page);
+			else {
+				if (PagePrivate(page))
+					try_to_free_buffers(page);
+				unlock_page(page);
+			}
+		}
+		put_page(page);
+	}
+}
+
 static void
 init_buffer_head(void *data)
 {
@@ -3370,6 +3468,7 @@ void __init buffer_init(void)
 				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 				SLAB_MEM_SPREAD),
 				init_buffer_head);
+	kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: buffer heads: Support slab defrag
  2010-01-29 20:49 ` buffer heads: Support slab defrag Christoph Lameter
@ 2010-01-30  1:59   ` Dave Chinner
  2010-02-01  6:39   ` Nick Piggin
  1 sibling, 0 replies; 56+ messages in thread
From: Dave Chinner @ 2010-01-30  1:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Fri, Jan 29, 2010 at 02:49:41PM -0600, Christoph Lameter wrote:
> Defragmentation support for buffer heads. We convert the references to
> buffers to struct page references and try to remove the buffers from
> those pages. If the pages are dirty then trigger writeout so that the
> buffer heads can be removed later.

NACK.

We don't want another random single page writeback trigger into
the VM - it will only slow down cleaning of dirty pages by causing
disk thrashing (i.e. turns writeback into small random write
workload), and that will ultimately slow down the rate at which we can
reclaim buffer heads.

Hence I suggest that if the buffer head is dirty, then just ignore
it - it'll be cleaned soon enough by one of the other mechanisms we
have and then it can be reclaimed in a later pass.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: buffer heads: Support slab defrag
  2010-01-29 20:49 ` buffer heads: Support slab defrag Christoph Lameter
  2010-01-30  1:59   ` Dave Chinner
@ 2010-02-01  6:39   ` Nick Piggin
  1 sibling, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2010-02-01  6:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Dave Chinner, Christoph Lameter, Rik van Riel,
	Pekka Enberg, akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins,
	linux-kernel

On Fri, Jan 29, 2010 at 02:49:41PM -0600, Christoph Lameter wrote:
> Defragmentation support for buffer heads. We convert the references to
> buffers to struct page references and try to remove the buffers from
> those pages. If the pages are dirty then trigger writeout so that the
> buffer heads can be removed later.
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  fs/buffer.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> Index: slab-2.6/fs/buffer.c
> ===================================================================
> --- slab-2.6.orig/fs/buffer.c	2010-01-22 15:09:43.000000000 -0600
> +++ slab-2.6/fs/buffer.c	2010-01-22 16:17:27.000000000 -0600
> @@ -3352,6 +3352,104 @@ int bh_submit_read(struct buffer_head *b
>  }
>  EXPORT_SYMBOL(bh_submit_read);
>  
> +/*
> + * Writeback a page to clean the dirty state
> + */
> +static void trigger_write(struct page *page)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +	int rc;
> +	struct writeback_control wbc = {
> +		.sync_mode = WB_SYNC_NONE,
> +		.nr_to_write = 1,
> +		.range_start = 0,
> +		.range_end = LLONG_MAX,
> +		.nonblocking = 1,
> +		.for_reclaim = 0
> +	};
> +
> +	if (!mapping->a_ops->writepage)
> +		/* No write method for the address space */
> +		return;
> +
> +	if (!clear_page_dirty_for_io(page))
> +		/* Someone else already triggered a write */
> +		return;
> +
> +	rc = mapping->a_ops->writepage(page, &wbc);
> +	if (rc < 0)
> +		/* I/O Error writing */
> +		return;
> +
> +	if (rc == AOP_WRITEPAGE_ACTIVATE)
> +		unlock_page(page);
> +}
> +
> +/*
> + * Get references on buffers.
> + *
> + * We obtain references on the page that uses the buffer. v[i] will point to
> + * the corresponding page after get_buffers() is through.
> + *
> + * We are safe from the underlying page being removed simply by doing
> + * a get_page_unless_zero. The buffer head removal may race at will.
> + * try_to_free_buffes will later take appropriate locks to remove the
> + * buffers if they are still there.
> + */
> +static void *get_buffers(struct kmem_cache *s, int nr, void **v)
> +{
> +	struct page *page;
> +	struct buffer_head *bh;
> +	int i, j;
> +	int n = 0;
> +
> +	for (i = 0; i < nr; i++) {
> +		bh = v[i];
> +		v[i] = NULL;
> +
> +		page = bh->b_page;
> +
> +		if (page && PagePrivate(page)) {
> +			for (j = 0; j < n; j++)
> +				if (page == v[j])
> +					continue;
> +		}
> +
> +		if (get_page_unless_zero(page))
> +			v[n++] = page;

This seems wrong to me. The page can have been reused at this
stage.

You technically can't re-check using page->private because that
can be anything and doesn't actually need to be a pointer. You
could re-check bh->b_page, provided that you ensure it is always
cleared before a page is detached, and the correct barriers are
in place.


> +	}
> +	return NULL;
> +}
> +
> +/*
> + * Despite its name: kick_buffers operates on a list of pointers to
> + * page structs that was set up by get_buffer().
> + */
> +static void kick_buffers(struct kmem_cache *s, int nr, void **v,
> +							void *private)
> +{
> +	struct page *page;
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		page = v[i];
> +
> +		if (!page || PageWriteback(page))
> +			continue;
> +
> +		if (trylock_page(page)) {
> +			if (PageDirty(page))
> +				trigger_write(page);
> +			else {
> +				if (PagePrivate(page))
> +					try_to_free_buffers(page);
> +				unlock_page(page);

PagePrivate doesn't necessarily mean it has buffers. try_to_release_page
should be a better idea.

> +			}
> +		}
> +		put_page(page);
> +	}
> +}
> +
>  static void
>  init_buffer_head(void *data)
>  {
> @@ -3370,6 +3468,7 @@ void __init buffer_init(void)
>  				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
>  				SLAB_MEM_SPREAD),
>  				init_buffer_head);
> +	kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
>  
>  	/*
>  	 * Limit the bh occupancy to 10% of ZONE_NORMAL


Buffer heads and buffer head refcounting really stinks badly. Although
I can see the need for a medium term solution until fsblock / some
actual sane refcounting.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* inodes: Support generic defragmentation
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (9 preceding siblings ...)
  2010-01-29 20:49 ` buffer heads: Support slab defrag Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-30  2:43   ` Dave Chinner
  2010-01-30 19:26   ` tytso
  2010-01-29 20:49 ` Filesystem: Ext2 filesystem defrag Christoph Lameter
                   ` (8 subsequent siblings)
  19 siblings, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Miklos Szeredi, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm, Nick Piggin,
	Hugh Dickins, linux-kernel

[-- Attachment #1: defrag_fs_generic --]
[-- Type: text/plain, Size: 5583 bytes --]

This implements the ability to remove inodes in a particular slab
from inode caches. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

FIXES NEEDED!

Note Miklos comments on the patch at http://lkml.indiana.edu/hypermail/linux/kernel/0810.1/2003.html

The way we obtain a reference to a inode entry may be unreliable since inode
refcounting works in different ways. Also a reference to the superblock is necessary
in order to be able to operate on the inodes.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/inode.c         |  123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    6 ++
 2 files changed, 129 insertions(+)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-01-29 12:03:04.000000000 -0600
+++ linux-2.6/fs/inode.c	2010-01-29 12:03:25.000000000 -0600
@@ -1538,6 +1538,128 @@ static int __init set_ihash_entries(char
 __setup("ihash_entries=", set_ihash_entries);
 
 /*
+ * Obtain a refcount on a list of struct inodes pointed to by v. If the
+ * inode is in the process of being freed then zap the v[] entry so that
+ * we skip the freeing attempts later.
+ *
+ * This is a generic function for the ->get slab defrag callback.
+ */
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * fs inode. The offset is the offset of the struct inode in the fs inode.
+ *
+ * The function adds to the pointers in v[] in order to make them point to
+ * struct inode. Then get_inodes() is used to get the refcount.
+ * The converted v[] pointers can then also be passed to the kick() callback
+ * without further processing.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+/*
+ * Generic callback function slab defrag ->kick methods. Takes the
+ * array with inodes where we obtained refcounts using fs_get_inodes()
+ * or get_inodes() and tries to free them.
+ */
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	int active;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				/*
+				 * Should we really be doing this? Or
+				 * limit the writeback here to only a few pages?
+				 *
+				 * Possibly an expensive operation but we
+				 * cannot reclaim the inode if the pages
+				 * are still present.
+				 */
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+
+		if (!inode)
+			/* inode is alrady being freed */
+			continue;
+
+		active = inode->i_sb->s_flags & MS_ACTIVE;
+		iput(inode);
+		if (abort || !active)
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+/*
  * Initialize the waitqueues and inode hash table.
  */
 void __init inode_init_early(void)
@@ -1576,6 +1698,7 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-01-29 12:03:04.000000000 -0600
+++ linux-2.6/include/linux/fs.h	2010-01-29 12:03:25.000000000 -0600
@@ -2466,5 +2466,11 @@ int __init get_filesystem_list(char *buf
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
 #define OPEN_FMODE(flag) ((__force fmode_t)((flag + 1) & O_ACCMODE))
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_FS_H */

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-29 20:49 ` inodes: Support generic defragmentation Christoph Lameter
@ 2010-01-30  2:43   ` Dave Chinner
  2010-02-01 17:50     ` Christoph Lameter
  2010-01-30 19:26   ` tytso
  1 sibling, 1 reply; 56+ messages in thread
From: Dave Chinner @ 2010-01-30  2:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Miklos Szeredi, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm, Nick Piggin,
	Hugh Dickins, linux-kernel

On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> This implements the ability to remove inodes in a particular slab
> from inode caches. In order to remove an inode we may have to write out
> the pages of an inode, the inode itself and remove the dentries referring
> to the node.
> 
> Provide generic functionality that can be used by filesystems that have
> their own inode caches to also tie into the defragmentation functions
> that are made available here.
> 
> FIXES NEEDED!
> 
> Note Miklos comments on the patch at http://lkml.indiana.edu/hypermail/linux/kernel/0810.1/2003.html
> 
> The way we obtain a reference to a inode entry may be unreliable since inode
> refcounting works in different ways. Also a reference to the superblock is necessary
> in order to be able to operate on the inodes.
> 
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Alexander Viro <viro@ftp.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  fs/inode.c         |  123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h |    6 ++
>  2 files changed, 129 insertions(+)
> 
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c	2010-01-29 12:03:04.000000000 -0600
> +++ linux-2.6/fs/inode.c	2010-01-29 12:03:25.000000000 -0600
> @@ -1538,6 +1538,128 @@ static int __init set_ihash_entries(char
>  __setup("ihash_entries=", set_ihash_entries);
>  
>  /*
> + * Obtain a refcount on a list of struct inodes pointed to by v. If the
> + * inode is in the process of being freed then zap the v[] entry so that
> + * we skip the freeing attempts later.
> + *
> + * This is a generic function for the ->get slab defrag callback.
> + */
> +void *get_inodes(struct kmem_cache *s, int nr, void **v)
> +{
> +	int i;
> +
> +	spin_lock(&inode_lock);
> +	for (i = 0; i < nr; i++) {
> +		struct inode *inode = v[i];
> +
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +			v[i] = NULL;
> +		else
> +			__iget(inode);
> +	}
> +	spin_unlock(&inode_lock);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(get_inodes);

How do you expect defrag to behave when the filesystem doesn't free
the inode immediately during dispose_list()? That is, the above code
only finds inodes that are still active at the VFS level but they
may still live for a significant period of time after the
dispose_list() call. This is a real issue now that XFS has combined
the VFS and XFS inodes into the same slab...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-30  2:43   ` Dave Chinner
@ 2010-02-01 17:50     ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-02-01 17:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Miklos Szeredi, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm, Nick Piggin,
	Hugh Dickins, linux-kernel

On Sat, 30 Jan 2010, Dave Chinner wrote:

> How do you expect defrag to behave when the filesystem doesn't free
> the inode immediately during dispose_list()? That is, the above code
> only finds inodes that are still active at the VFS level but they
> may still live for a significant period of time after the
> dispose_list() call. This is a real issue now that XFS has combined
> the VFS and XFS inodes into the same slab...

Then the freeing of the slab has to be delayed until the objects are
freed.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-29 20:49 ` inodes: Support generic defragmentation Christoph Lameter
  2010-01-30  2:43   ` Dave Chinner
@ 2010-01-30 19:26   ` tytso
  2010-01-31  8:34     ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: tytso @ 2010-01-30 19:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Dave Chinner, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> This implements the ability to remove inodes in a particular slab
> from inode caches. In order to remove an inode we may have to write out
> the pages of an inode, the inode itself and remove the dentries referring
> to the node.

How often is this going to happen?  Removing an inode is an incredibly
expensive operation.  We have to eject all of the pages from the page
cache, and if those pages are getting a huge amount of use --- say,
those corresponding to some shared library like libc --- or which have
a huge number of pages that are actively getting used, the thrashing
that is going to result is going to enormous.

There needs to be some kind of cost/benefit analysis done about
whether or not this is worth it.  Does it make sense to potentially
force hundreds and hundreds of megaytes of pages to get thrashed in
and out just to recover a single 4k page?  In some cases, maybe yes.
But in other cases, the results could be disastrous.

> +/*
> + * Generic callback function slab defrag ->kick methods. Takes the
> + * array with inodes where we obtained refcounts using fs_get_inodes()
> + * or get_inodes() and tries to free them.
> + */
> +void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
> +{
> +	struct inode *inode;
> +	int i;
> +	int abort = 0;
> +	LIST_HEAD(freeable);
> +	int active;
> +
> +	for (i = 0; i < nr; i++) {
> +		inode = v[i];
> +		if (!inode)
> +			continue;

In some cases, it's going to be impossible to empty a particular slab
cache page.  For example, there may be one inode which has pages
locked into memory, or which we may decide (once we add some
intelligence into this function) is really not worth ejecting.  In
that case, there's no point dumping the rest of the inodes on that
particular slab page.   

> +		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +			if (remove_inode_buffers(inode))
> +				/*
> +				 * Should we really be doing this? Or
> +				 * limit the writeback here to only a few pages?
> +				 *
> +				 * Possibly an expensive operation but we
> +				 * cannot reclaim the inode if the pages
> +				 * are still present.
> +				 */
> +				invalidate_mapping_pages(&inode->i_data,
> +								0, -1);

> +		}

I do not thing this function does what you think it does....

"invalidate_mapping_pages() will not block on I/O activity, and it
will refuse to invalidate pages which are dirty, locked, under
writeback, or mapped into page tables."

So you need to force the data to be written *first*, then get the
pages removed from the page table, and only then, call
invalidate_mapping_pages().  Otherwise, this is just going to
pointlessly drop pages from the page cache and trashing the page
cache's effectiveness, without actually making it possible to drop a
particular inode if it is being used at all by any process.

Presumably then the defrag code, since it was unable to free a
particular page, will go on pillaging and raping other inodes in the
inode cache, until it can actually create a hugepage.  This is why you
really shouldn't start trying to trash an inode until you're
**really** sure it's possible completely evict a 4k slab page of all
of its inodes.

	   	       	  	     - Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-30 19:26   ` tytso
@ 2010-01-31  8:34     ` Andi Kleen
  2010-01-31 13:59       ` Dave Chinner
  2010-01-31 21:02       ` tytso
  0 siblings, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2010-01-31  8:34 UTC (permalink / raw)
  To: tytso, Christoph Lameter, Andi Kleen, Dave Chinner,
	Miklos Szeredi, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm, Nick Piggin,
	Hugh Dickins, linux-kernel

On Sat, Jan 30, 2010 at 02:26:23PM -0500, tytso@mit.edu wrote:
> On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> > This implements the ability to remove inodes in a particular slab
> > from inode caches. In order to remove an inode we may have to write out
> > the pages of an inode, the inode itself and remove the dentries referring
> > to the node.
> 
> How often is this going to happen?  Removing an inode is an incredibly

The standard case is the classic updatedb. Lots of dentries/inodes cached
with no or little corresponding data cache.

> a huge number of pages that are actively getting used, the thrashing
> that is going to result is going to enormous.

I think the consensus so far is to try to avoid any inodes/dentries
which are dirty or used in any way.

I personally would prefer it to be more aggressive for memory offlining
though for RAS purposes though, but just handling the unused cases is a 
good first step.

> "invalidate_mapping_pages() will not block on I/O activity, and it
> will refuse to invalidate pages which are dirty, locked, under
> writeback, or mapped into page tables."

I think that was the point.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-31  8:34     ` Andi Kleen
@ 2010-01-31 13:59       ` Dave Chinner
  2010-02-03 15:31         ` Christoph Lameter
  2010-01-31 21:02       ` tytso
  1 sibling, 1 reply; 56+ messages in thread
From: Dave Chinner @ 2010-01-31 13:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: tytso, Christoph Lameter, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Sun, Jan 31, 2010 at 09:34:09AM +0100, Andi Kleen wrote:
> On Sat, Jan 30, 2010 at 02:26:23PM -0500, tytso@mit.edu wrote:
> > On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> > > This implements the ability to remove inodes in a particular slab
> > > from inode caches. In order to remove an inode we may have to write out
> > > the pages of an inode, the inode itself and remove the dentries referring
> > > to the node.
> > 
> > How often is this going to happen?  Removing an inode is an incredibly
> 
> The standard case is the classic updatedb. Lots of dentries/inodes cached
> with no or little corresponding data cache.

I don't believe that updatedb has anything to do with causing
internal inode/dentry slab fragmentation. In all my testing I rarely
see use-once filesystem traversals cause internal slab
fragmentation. This appears to be a result of use-once filesystem
traversal resulting in slab pages full of objects that have the same
locality of access.  Hence each new slab page that traversal
allocates will contain objects that will be adjacent in the LRU.
Hence LRU-based reclaim is very likely to free all the objects on
each page in the same pass and as such no fragmentation will occur.

All the cases of inode/dentry slab fragmentation I have seen are a
result of access patterns that result in slab pages containing
objects with different temporal localities. It's when the access
pattern is sufficiently distributed throughout the working set we
get the "need to free 95% of the objects in the entire cache to free
a single page" type of reclaim behaviour.

AFAICT, the defrag patches as they stand don't really address the
fundamental problem of differing temporal locality inside a slab
page.  It makes the assumption that "partial page == defrag
candidate" but there isn't any further consideration of when any of
the remaing objects were last accessed. I think that this really
does need to be taken into account, especially considering that the
allocator tries to fill partial pages with new objects before
allocating new pages and so the page under reclaim might contain
very recently allocated objects.

Someone in a previous discussion on this patch set (Nick? Hugh,
maybe? I can't find the reference right now) mentioned something
like this about the design of the force-reclaim operations. IIRC the
suggestion was that it may be better to track LRU-ness by per-slab
page rather than per-object so that reclaim can target the slab
pages that - on aggregate - had the oldest objects in it. I think
this has merit - prevention of internal fragmentation seems like a
better approach to me than to try to cure it after it is already
present....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-31 13:59       ` Dave Chinner
@ 2010-02-03 15:31         ` Christoph Lameter
  2010-02-04  0:34           ` Dave Chinner
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2010-02-03 15:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, tytso, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, 1 Feb 2010, Dave Chinner wrote:

> > The standard case is the classic updatedb. Lots of dentries/inodes cached
> > with no or little corresponding data cache.
>
> I don't believe that updatedb has anything to do with causing
> internal inode/dentry slab fragmentation. In all my testing I rarely
> see use-once filesystem traversals cause internal slab
> fragmentation. This appears to be a result of use-once filesystem
> traversal resulting in slab pages full of objects that have the same
> locality of access.  Hence each new slab page that traversal
> allocates will contain objects that will be adjacent in the LRU.
> Hence LRU-based reclaim is very likely to free all the objects on
> each page in the same pass and as such no fragmentation will occur.

updatedb causes lots of partially allocated slab pages. While updatedb
runs other filesystem activities occur. And updatedb does not work in
straightforward linear fashion. dentries are cached and slowly expired etc
etc. Updatedb may not cause the fragmentation on a level that you observed
with some of the filesystem loads on large systems.

> All the cases of inode/dentry slab fragmentation I have seen are a
> result of access patterns that result in slab pages containing
> objects with different temporal localities. It's when the access
> pattern is sufficiently distributed throughout the working set we
> get the "need to free 95% of the objects in the entire cache to free
> a single page" type of reclaim behaviour.

There are also other factors at play like the different NUMA node,
concurrent processes. A strict optimized HPC workload may be able to
eliminate other factors but that is not the case for typical workloads.
Access patterns are typically somewhat distribyted.

> AFAICT, the defrag patches as they stand don't really address the
> fundamental problem of differing temporal locality inside a slab
> page.  It makes the assumption that "partial page == defrag
> candidate" but there isn't any further consideration of when any of
> the remaing objects were last accessed. I think that this really
> does need to be taken into account, especially considering that the
> allocator tries to fill partial pages with new objects before
> allocating new pages and so the page under reclaim might contain
> very recently allocated objects.

Reclaim is only run if there is memory pressure. This means that lots of
reclaimable entities exist and therefore we can assume that many of these
have had a somewhat long lifetime. The allocator tries to fill partial
pages with new objects and then retires those pages to the full slab list.
Those are not subject to reclaim efforts covered here. A page under
reclaim is likely to contain many recently freed objects.

The remaining objects may have a long lifetime and a high usage pattern
but it is worth to relocate them into other slabs if they prevent reclaim
of the page. Relocation occurs in this patchset by reclaim and then the
next use likely causes the reallocation in a partially allocated slab.
This means that objects with a high usage count will tend to be aggregated
in full slabs that are no longer subject to targeted reclaim.

We could improve the situation by allowing the moving of objects (which
would avoid the reclaim and realloc) but that is complex and so needs to
be deferred to a second stage (same approach we went through with page
migration).

> Someone in a previous discussion on this patch set (Nick? Hugh,
> maybe? I can't find the reference right now) mentioned something
> like this about the design of the force-reclaim operations. IIRC the
> suggestion was that it may be better to track LRU-ness by per-slab
> page rather than per-object so that reclaim can target the slab
> pages that - on aggregate - had the oldest objects in it. I think
> this has merit - prevention of internal fragmentation seems like a
> better approach to me than to try to cure it after it is already
> present....

LRUness exists in terms of the list of partial slab pages. Frequently
allocated slabs are in the front of the queue and less used slabs are in
the rear. Defrag/reclaim occurs from the rear.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-03 15:31         ` Christoph Lameter
@ 2010-02-04  0:34           ` Dave Chinner
  2010-02-04  3:07             ` tytso
  0 siblings, 1 reply; 56+ messages in thread
From: Dave Chinner @ 2010-02-04  0:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, tytso, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Wed, Feb 03, 2010 at 09:31:49AM -0600, Christoph Lameter wrote:
> On Mon, 1 Feb 2010, Dave Chinner wrote:
> 
> > > The standard case is the classic updatedb. Lots of dentries/inodes cached
> > > with no or little corresponding data cache.
> >
> > I don't believe that updatedb has anything to do with causing
> > internal inode/dentry slab fragmentation. In all my testing I rarely
> > see use-once filesystem traversals cause internal slab
> > fragmentation. This appears to be a result of use-once filesystem
> > traversal resulting in slab pages full of objects that have the same
> > locality of access.  Hence each new slab page that traversal
> > allocates will contain objects that will be adjacent in the LRU.
> > Hence LRU-based reclaim is very likely to free all the objects on
> > each page in the same pass and as such no fragmentation will occur.
> 
> updatedb causes lots of partially allocated slab pages. While updatedb
> runs other filesystem activities occur. And updatedb does not work in
> straightforward linear fashion. dentries are cached and slowly expired etc
> etc.

Sure, but my point was that updatedb hits lots of inodes only once,
and for those objects the order of caching and expiration are
exactly the same. Hence after reclaim of the updatedb dentries/inodes
the amount of fragmentation in the slab will be almost exactly the
same as it was before the updatedb run.

> > All the cases of inode/dentry slab fragmentation I have seen are a
> > result of access patterns that result in slab pages containing
> > objects with different temporal localities. It's when the access
> > pattern is sufficiently distributed throughout the working set we
> > get the "need to free 95% of the objects in the entire cache to free
> > a single page" type of reclaim behaviour.
> 
> There are also other factors at play like the different NUMA node,
> concurrent processes.

Yes, those are just more factors in the access patterns being
"sufficiently distributed throughout the working set".

> > AFAICT, the defrag patches as they stand don't really address the
> > fundamental problem of differing temporal locality inside a slab
> > page.  It makes the assumption that "partial page == defrag
> > candidate" but there isn't any further consideration of when any of
> > the remaing objects were last accessed. I think that this really
> > does need to be taken into account, especially considering that the
> > allocator tries to fill partial pages with new objects before
> > allocating new pages and so the page under reclaim might contain
> > very recently allocated objects.
> 
> Reclaim is only run if there is memory pressure. This means that lots of
> reclaimable entities exist and therefore we can assume that many of these
> have had a somewhat long lifetime. The allocator tries to fill partial
> pages with new objects and then retires those pages to the full slab list.
> Those are not subject to reclaim efforts covered here. A page under
> reclaim is likely to contain many recently freed objects.

Not necessarily. It might contain only one recently reclaimed object,
but have several other hot objects in the page....

> The remaining objects may have a long lifetime and a high usage pattern
> but it is worth to relocate them into other slabs if they prevent reclaim
> of the page.

I completely disagree. If you have to trash all the cache hot
information related to the cached object in the process of
relocating it, then you've just screwed up application performance
and in a completely unpredictable manner. Admins will be tearing out
their hair trying to work out why their applications randomly slow
down....

> > Someone in a previous discussion on this patch set (Nick? Hugh,
> > maybe? I can't find the reference right now) mentioned something
> > like this about the design of the force-reclaim operations. IIRC the
> > suggestion was that it may be better to track LRU-ness by per-slab
> > page rather than per-object so that reclaim can target the slab
> > pages that - on aggregate - had the oldest objects in it. I think
> > this has merit - prevention of internal fragmentation seems like a
> > better approach to me than to try to cure it after it is already
> > present....
> 
> LRUness exists in terms of the list of partial slab pages. Frequently
> allocated slabs are in the front of the queue and less used slabs are in
> the rear. Defrag/reclaim occurs from the rear.

You missed my point again. You're still talking about tracking pages
with no regard to the objects remaining in the pages. A page, full
or partial, is a candidate for object reclaim if none of the objects
on it are referenced and have not been referenced for some time.

You are currently relying on the existing LRU reclaim to move a slab
from full to partial to trigger defragmentation, but you ignore the
hotness of the rest of the objects on the page by trying to reclaim
the page that has been partial for the longest period of time.

What it comes down to is that the slab has two states for objects -
allocated and free - but what we really need here is 3 states -
allocated, unused and freed. We currently track unused objects
outside the slab in LRU lists and, IMO, that is the source of our
fragmentation problems because it has no knowledge of the spatial
layout of the slabs and the state of other objects in the page.

What I'm suggesting is that we ditch the external LRUs and track the
"unused" state inside the slab and then use that knowledge to decide
which pages to reclaim.  e.g. slab_object_used() is called when the
first reference on an object is taken. slab_object_unused() is
called when the reference count goes to zero. The slab can then
track unused objects internally and when reclaim is needed can
select pages (full or partial) that only contain unused objects to
reclaim.

>From there the existing reclaim algorithms could be used to reclaim
the objects. i.e. the shrinkers become a slab reclaim callout that
are passed a linked list of objects to reclaim, very similar to the
way __shrink_dcache_sb() and prune_icache() first build a list of
objects to reclaim, then work off that list of objects.

If the goal is to reduce fragmentation, then this seems like a
much better approach to me - it is inherently fragmentation
resistent and much more closely aligned to existing object reclaim
algorithms.

If the goal is random slab page shootdown (e.g. for hwpoison), then
it's a much more complex problem because you can't shoot down
active, referenced objects without revoke(). Hence I think the
two problem spaces should be kept separate as it's not obvious
that they can both be solved with the same mechanism....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04  0:34           ` Dave Chinner
@ 2010-02-04  3:07             ` tytso
  2010-02-04  3:39               ` Dave Chinner
  0 siblings, 1 reply; 56+ messages in thread
From: tytso @ 2010-02-04  3:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Lameter, Andi Kleen, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Thu, Feb 04, 2010 at 11:34:10AM +1100, Dave Chinner wrote:
> 
> I completely disagree. If you have to trash all the cache hot
> information related to the cached object in the process of
> relocating it, then you've just screwed up application performance
> and in a completely unpredictable manner. Admins will be tearing out
> their hair trying to work out why their applications randomly slow
> down....

...

> You missed my point again. You're still talking about tracking pages
> with no regard to the objects remaining in the pages. A page, full
> or partial, is a candidate for object reclaim if none of the objects
> on it are referenced and have not been referenced for some time.
> 
> You are currently relying on the existing LRU reclaim to move a slab
> from full to partial to trigger defragmentation, but you ignore the
> hotness of the rest of the objects on the page by trying to reclaim
> the page that has been partial for the longest period of time.

Well said.

This is exactly what I was complaining about as well, but apparently I
wasn't understood the first time either.  :-(

This *has* to be fixed, or this set of patches is going to completely
trash the overall system performance, by trashing the page cache.

> What it comes down to is that the slab has two states for objects -
> allocated and free - but what we really need here is 3 states -
> allocated, unused and freed. We currently track unused objects
> outside the slab in LRU lists and, IMO, that is the source of our
> fragmentation problems because it has no knowledge of the spatial
> layout of the slabs and the state of other objects in the page.
> 
> What I'm suggesting is that we ditch the external LRUs and track the
> "unused" state inside the slab and then use that knowledge to decide
> which pages to reclaim.

Or maybe we need to have the way to track the LRU of the slab page as
a whole?  Any time we touch an object on the slab page, we touch the
last updatedness of the slab as a hole.

It's actually more complicated than that, though.  Even if no one has
touched a particular inode, if one of the inode in the slab page is
pinned down because it is in use, so there's no point for the
defragmenter trying to throw away valuable cached pages associated
with other inodes in the same slab page --- since because of that
single pinned inode, YOU'RE NEVER GOING TO DEFRAG THAT PAGE.

And of course, if the inode is pinned down because it is opened and/or
mmaped, then its associated dcache entry can't be freed either, so
there's no point trying to trash all of its sibling dentries on the
same page as that dcache entry.

Randomly shooting down dcache and inode entries in the hopes of
creating coalescing free pages into hugepages is just not cool.  If
you're that desperate, you might as well just do "echo 3 >
/proc/sys/vm/drop_caches".  From my read of the algorithms, it's going
to be almost as destructive to system performance.

	       	     	     	     	     - Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04  3:07             ` tytso
@ 2010-02-04  3:39               ` Dave Chinner
  2010-02-04  9:33                 ` Nick Piggin
  2010-02-04 16:59                 ` Christoph Lameter
  0 siblings, 2 replies; 56+ messages in thread
From: Dave Chinner @ 2010-02-04  3:39 UTC (permalink / raw)
  To: tytso, Christoph Lameter, Andi Kleen, Miklos Szeredi,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Nick Piggin, Hugh Dickins,
	linux-kernel

On Wed, Feb 03, 2010 at 10:07:36PM -0500, tytso@mit.edu wrote:
> On Thu, Feb 04, 2010 at 11:34:10AM +1100, Dave Chinner wrote:
> > What it comes down to is that the slab has two states for objects -
> > allocated and free - but what we really need here is 3 states -
> > allocated, unused and freed. We currently track unused objects
> > outside the slab in LRU lists and, IMO, that is the source of our
> > fragmentation problems because it has no knowledge of the spatial
> > layout of the slabs and the state of other objects in the page.
> > 
> > What I'm suggesting is that we ditch the external LRUs and track the
> > "unused" state inside the slab and then use that knowledge to decide
> > which pages to reclaim.
> 
> Or maybe we need to have the way to track the LRU of the slab page as
> a whole?  Any time we touch an object on the slab page, we touch the
> last updatedness of the slab as a hole.

Yes, that's pretty much what I have been trying to describe. ;)
(And, IIUC, what I think Nick has been trying to describe as well
when he's been saying we should "turn reclaim upside down".)

It seems to me to be pretty simple to track, too, if we define pages
for reclaim to only be those that are full of unused objects. i.e.
the pages have the two states:

	- Active: some allocated and referenced object on the page
		=> no need for LRU tracking of these
	- Unused: all allocated objects on the page are not used
		=> these pages are LRU tracked within the slab

A single referenced object is enough to change the state of the
page from Unused to Active, and when page transitions from
Active to Unused is goes on the MRU end of the LRU queue.
Reclaim would then start with the oldest pages on the LRU....

> It's actually more complicated than that, though.  Even if no one has
> touched a particular inode, if one of the inode in the slab page is
> pinned down because it is in use,

A single active object like this would the slab page Active, and
therefore not a candidate for reclaim. Also, we already reclaim
dentries before inodes because dentries pin inodes, so our
algorithms for reclaim already deal with these ordering issues for
us.

...

> And of course, if the inode is pinned down because it is opened and/or
> mmaped, then its associated dcache entry can't be freed either, so
> there's no point trying to trash all of its sibling dentries on the
> same page as that dcache entry.

Agreed - that's why I think preventing fragemntation caused by LRU
reclaim is best dealt with internally to slab where both object age
and locality can be taken into account.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04  3:39               ` Dave Chinner
@ 2010-02-04  9:33                 ` Nick Piggin
  2010-02-04 17:13                   ` Christoph Lameter
  2010-02-04 16:59                 ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2010-02-04  9:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, Christoph Lameter, Andi Kleen, Miklos Szeredi,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Nick Piggin, Hugh Dickins,
	linux-kernel

On Thu, Feb 04, 2010 at 02:39:11PM +1100, Dave Chinner wrote:
> On Wed, Feb 03, 2010 at 10:07:36PM -0500, tytso@mit.edu wrote:
> > On Thu, Feb 04, 2010 at 11:34:10AM +1100, Dave Chinner wrote:
> > > What it comes down to is that the slab has two states for objects -
> > > allocated and free - but what we really need here is 3 states -
> > > allocated, unused and freed. We currently track unused objects
> > > outside the slab in LRU lists and, IMO, that is the source of our
> > > fragmentation problems because it has no knowledge of the spatial
> > > layout of the slabs and the state of other objects in the page.
> > > 
> > > What I'm suggesting is that we ditch the external LRUs and track the
> > > "unused" state inside the slab and then use that knowledge to decide
> > > which pages to reclaim.
> > 
> > Or maybe we need to have the way to track the LRU of the slab page as
> > a whole?  Any time we touch an object on the slab page, we touch the
> > last updatedness of the slab as a hole.
> 
> Yes, that's pretty much what I have been trying to describe. ;)
> (And, IIUC, what I think Nick has been trying to describe as well
> when he's been saying we should "turn reclaim upside down".)

Well what I described is to do the slab pinning from the reclaim path
(rather than from slab calling into the subsystem). All slab locking
basically "innermost", so you can pretty much poke the slab layer as
much as you like from the subsystem.

After that, LRU on slabs should be fairly easy. Slab could provide a
private per-slab pointer for example that is managed by the caller.
Subsystem can then call into slab to find the objects.

 
> It seems to me to be pretty simple to track, too, if we define pages
> for reclaim to only be those that are full of unused objects. i.e.
> the pages have the two states:
> 
> 	- Active: some allocated and referenced object on the page
> 		=> no need for LRU tracking of these
> 	- Unused: all allocated objects on the page are not used
> 		=> these pages are LRU tracked within the slab
> 
> A single referenced object is enough to change the state of the
> page from Unused to Active, and when page transitions from
> Active to Unused is goes on the MRU end of the LRU queue.
> Reclaim would then start with the oldest pages on the LRU....
> 
> > It's actually more complicated than that, though.  Even if no one has
> > touched a particular inode, if one of the inode in the slab page is
> > pinned down because it is in use,
> 
> A single active object like this would the slab page Active, and
> therefore not a candidate for reclaim. Also, we already reclaim
> dentries before inodes because dentries pin inodes, so our
> algorithms for reclaim already deal with these ordering issues for
> us.
> 
> ...
> 
> > And of course, if the inode is pinned down because it is opened and/or
> > mmaped, then its associated dcache entry can't be freed either, so
> > there's no point trying to trash all of its sibling dentries on the
> > same page as that dcache entry.
> 
> Agreed - that's why I think preventing fragemntation caused by LRU
> reclaim is best dealt with internally to slab where both object age
> and locality can be taken into account.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04  9:33                 ` Nick Piggin
@ 2010-02-04 17:13                   ` Christoph Lameter
  2010-02-08  7:37                     ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2010-02-04 17:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, tytso, Andi Kleen, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Thu, 4 Feb 2010, Nick Piggin wrote:

> Well what I described is to do the slab pinning from the reclaim path
> (rather than from slab calling into the subsystem). All slab locking
> basically "innermost", so you can pretty much poke the slab layer as
> much as you like from the subsystem.

Reclaim/defrag is called from the reclaim path (of the VM). We could
enable a call from the fs reclaim code into the slab. But how would this
work?

> After that, LRU on slabs should be fairly easy. Slab could provide a
> private per-slab pointer for example that is managed by the caller.
> Subsystem can then call into slab to find the objects.

Sure with some minor changes we could have a call that is giving you the
list of neighboring objects in a slab, while locking it? Then you can look
at the objects and decide which ones can be tossed and then do another
call to release the objects and unlock the slab.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04 17:13                   ` Christoph Lameter
@ 2010-02-08  7:37                     ` Nick Piggin
  2010-02-08 17:40                       ` Christoph Lameter
  2010-02-08 22:13                       ` Dave Chinner
  0 siblings, 2 replies; 56+ messages in thread
From: Nick Piggin @ 2010-02-08  7:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Chinner, tytso, Andi Kleen, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Thu, Feb 04, 2010 at 11:13:15AM -0600, Christoph Lameter wrote:
> On Thu, 4 Feb 2010, Nick Piggin wrote:
> 
> > Well what I described is to do the slab pinning from the reclaim path
> > (rather than from slab calling into the subsystem). All slab locking
> > basically "innermost", so you can pretty much poke the slab layer as
> > much as you like from the subsystem.
> 
> Reclaim/defrag is called from the reclaim path (of the VM). We could
> enable a call from the fs reclaim code into the slab. But how would this
> work?

Well the exact details will depend, but I feel that things should
get easier because you pin the object (and therefore the slab) via
the normal and well tested reclaim paths.

So for example, for dcache, you will come in and take the normal
locks: dcache_lock, sb_lock, pin the sb, umount_lock. At which
point you have pinned dentries without changing any locking. So
then you can find the first entry on the LRU, and should be able
to then build a list of dentries on the same slab.

You still have the potential issue of now finding objects that would
not be visible by searching the LRU alone. However at least the
locking should be simplified.

> > After that, LRU on slabs should be fairly easy. Slab could provide a
> > private per-slab pointer for example that is managed by the caller.
> > Subsystem can then call into slab to find the objects.
> 
> Sure with some minor changes we could have a call that is giving you the
> list of neighboring objects in a slab, while locking it? Then you can look
> at the objects and decide which ones can be tossed and then do another
> call to release the objects and unlock the slab.

Yep. Well... you may not even need to ask slab layer to lock the
slab. Provided that the subsystem is locking out changes. It could
possibly be helpful to have a call to lock and unlock the slab,
although usage of such an API would have to be very careful.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-08  7:37                     ` Nick Piggin
@ 2010-02-08 17:40                       ` Christoph Lameter
  2010-02-08 22:13                       ` Dave Chinner
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-02-08 17:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, tytso, Andi Kleen, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, 8 Feb 2010, Nick Piggin wrote:

> > > After that, LRU on slabs should be fairly easy. Slab could provide a
> > > private per-slab pointer for example that is managed by the caller.
> > > Subsystem can then call into slab to find the objects.
> >
> > Sure with some minor changes we could have a call that is giving you the
> > list of neighboring objects in a slab, while locking it? Then you can look
> > at the objects and decide which ones can be tossed and then do another
> > call to release the objects and unlock the slab.
>
> Yep. Well... you may not even need to ask slab layer to lock the
> slab. Provided that the subsystem is locking out changes. It could
> possibly be helpful to have a call to lock and unlock the slab,
> although usage of such an API would have to be very careful.

True, if you are holding a reference to an object in a slab page and
there is a guarantee that the object is not going away then the slab is already
effectively pinned.

So we just need a call that returns

1. The number of allocated objects in a slab page
2. The total possible number of objects
3. A list of pointers to the objects

?

Then reclaim could make a decision if you want these objects to be
reclaimed.

Such a function could actually be a much less code than the current
patchset and would also be easy to do for SLAB/SLOB.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-08  7:37                     ` Nick Piggin
  2010-02-08 17:40                       ` Christoph Lameter
@ 2010-02-08 22:13                       ` Dave Chinner
  1 sibling, 0 replies; 56+ messages in thread
From: Dave Chinner @ 2010-02-08 22:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, tytso, Andi Kleen, Miklos Szeredi,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Nick Piggin, Hugh Dickins,
	linux-kernel

On Mon, Feb 08, 2010 at 06:37:53PM +1100, Nick Piggin wrote:
> On Thu, Feb 04, 2010 at 11:13:15AM -0600, Christoph Lameter wrote:
> > On Thu, 4 Feb 2010, Nick Piggin wrote:
> > 
> > > Well what I described is to do the slab pinning from the reclaim path
> > > (rather than from slab calling into the subsystem). All slab locking
> > > basically "innermost", so you can pretty much poke the slab layer as
> > > much as you like from the subsystem.
> > 
> > Reclaim/defrag is called from the reclaim path (of the VM). We could
> > enable a call from the fs reclaim code into the slab. But how would this
> > work?
> 
> Well the exact details will depend, but I feel that things should
> get easier because you pin the object (and therefore the slab) via
> the normal and well tested reclaim paths.
> 
> So for example, for dcache, you will come in and take the normal
> locks: dcache_lock, sb_lock, pin the sb, umount_lock. At which
> point you have pinned dentries without changing any locking. So
> then you can find the first entry on the LRU, and should be able
> to then build a list of dentries on the same slab.
> 
> You still have the potential issue of now finding objects that would
> not be visible by searching the LRU alone. However at least the
> locking should be simplified.

Very true, but that leads us to the same problem of fragmented
caches because we empty unused objects off slabs that are still
pinned by hot objects and don't free the page. I agree that we can't
totally avoid this problem, but I still think that using an object
based LRU for reclaim has a fundamental mismatch with page based
reclaim that makes this problem worse than it could be.

FWIW, if we change the above to keeping a page based LRU in the slab
cache and the slab picks a page to reclaim, then the problem goes
mostly away, I think. We don't need to pin the slab to select and
prepare a page to reclaim - the cache only needs to be locked before
it starts reclaim. I think this has a much better chance of
reclaiming entire pages in situations where LRU based reclaim will
leave fragmentation.

i.e. instead of:

	shrink_slab
	  -> external shrinker
	    -> lock cache
	    -> find reclaimable object
	      -> call into slab w/ object
	        -> return longer list of objects
	    -> reclaim objects

we do:

	shrink_slab
	  -> internal shrinker
	    -> find oldest page and make object list
	      -> external shrinker
	        -> lock cache
		-> reclaim objects

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04  3:39               ` Dave Chinner
  2010-02-04  9:33                 ` Nick Piggin
@ 2010-02-04 16:59                 ` Christoph Lameter
  2010-02-06  0:39                   ` Dave Chinner
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2010-02-04 16:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, Andi Kleen, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Thu, 4 Feb 2010, Dave Chinner wrote:

> > Or maybe we need to have the way to track the LRU of the slab page as
> > a whole?  Any time we touch an object on the slab page, we touch the
> > last updatedness of the slab as a hole.
>
> Yes, that's pretty much what I have been trying to describe. ;)
> (And, IIUC, what I think Nick has been trying to describe as well
> when he's been saying we should "turn reclaim upside down".)
>
> It seems to me to be pretty simple to track, too, if we define pages
> for reclaim to only be those that are full of unused objects. i.e.
> the pages have the two states:
>
> 	- Active: some allocated and referenced object on the page
> 		=> no need for LRU tracking of these
> 	- Unused: all allocated objects on the page are not used
> 		=> these pages are LRU tracked within the slab
>
> A single referenced object is enough to change the state of the
> page from Unused to Active, and when page transitions from
> Active to Unused is goes on the MRU end of the LRU queue.
> Reclaim would then start with the oldest pages on the LRU....

These are describing ways of reclaim that could be implemented by the fs
layer. The information what item is "unused" or "referenced" is a notion
of the fs. The slab caches know only of two object states: Free or
allocated. LRU handling of slab pages is something entirely different
from the LRU of the inodes and dentries.

> > And of course, if the inode is pinned down because it is opened and/or
> > mmaped, then its associated dcache entry can't be freed either, so
> > there's no point trying to trash all of its sibling dentries on the
> > same page as that dcache entry.
>
> Agreed - that's why I think preventing fragemntation caused by LRU
> reclaim is best dealt with internally to slab where both object age
> and locality can be taken into account.

Object age is not known by the slab. Locality is only considered in terms
of hardware placement (Numa nodes) not in relationship to objects of other
caches (like inodes and dentries) or the same caches.

If we want this then we may end up with a special allocator for the
filesystem.

You and I have discussed a couple of years ago to add a reference count to
the objects of the slab allocator. Those explorations resulted in am much
more complicated and different allocator that is geared to the needs of
the filesystem for reclaim.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-04 16:59                 ` Christoph Lameter
@ 2010-02-06  0:39                   ` Dave Chinner
  0 siblings, 0 replies; 56+ messages in thread
From: Dave Chinner @ 2010-02-06  0:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: tytso, Andi Kleen, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Thu, Feb 04, 2010 at 10:59:26AM -0600, Christoph Lameter wrote:
> On Thu, 4 Feb 2010, Dave Chinner wrote:
> 
> > > Or maybe we need to have the way to track the LRU of the slab page as
> > > a whole?  Any time we touch an object on the slab page, we touch the
> > > last updatedness of the slab as a hole.
> >
> > Yes, that's pretty much what I have been trying to describe. ;)
> > (And, IIUC, what I think Nick has been trying to describe as well
> > when he's been saying we should "turn reclaim upside down".)
> >
> > It seems to me to be pretty simple to track, too, if we define pages
> > for reclaim to only be those that are full of unused objects. i.e.
> > the pages have the two states:
> >
> > 	- Active: some allocated and referenced object on the page
> > 		=> no need for LRU tracking of these
> > 	- Unused: all allocated objects on the page are not used
> > 		=> these pages are LRU tracked within the slab
> >
> > A single referenced object is enough to change the state of the
> > page from Unused to Active, and when page transitions from
> > Active to Unused is goes on the MRU end of the LRU queue.
> > Reclaim would then start with the oldest pages on the LRU....
> 
> These are describing ways of reclaim that could be implemented by the fs
> layer. The information what item is "unused" or "referenced" is a notion
> of the fs. The slab caches know only of two object states: Free or
> allocated. LRU handling of slab pages is something entirely different
> from the LRU of the inodes and dentries.

Ah, perhaps you missed my previous email in the thread about adding
a third object state to the slab - i.e. an unused state?  And an
interface (slab_object_used()/slab_object_unused()) to allow the
external uses to tell the slab about state changes of objects
on the first/last reference to the object. That would allow the
tracking as I stated above....

> > > And of course, if the inode is pinned down because it is opened and/or
> > > mmaped, then its associated dcache entry can't be freed either, so
> > > there's no point trying to trash all of its sibling dentries on the
> > > same page as that dcache entry.
> >
> > Agreed - that's why I think preventing fragemntation caused by LRU
> > reclaim is best dealt with internally to slab where both object age
> > and locality can be taken into account.
> 
> Object age is not known by the slab.

See above.

> Locality is only considered in terms
> of hardware placement (Numa nodes) not in relationship to objects of other
> caches (like inodes and dentries) or the same caches.

And that is the defficiency we've been talking about correcting! i.e
that object <-> page locality needs tobe taken into account during
reclaim. Moving used/unused knowledge into the slab where page/object
locality is known is one way of doing that....

> If we want this then we may end up with a special allocator for the
> filesystem.

I don't see why a small extension to the slab code can't fix this...

> You and I have discussed a couple of years ago to add a reference count to
> the objects of the slab allocator. Those explorations resulted in am much
> more complicated and different allocator that is geared to the needs of
> the filesystem for reclaim.

And those discussions and explorations lead to the current defrag
code. After a couple of year, I don't think that the design we came
up with back then is the best way to approach the problem - it still
has many, many flaws. We need to explore different approaches
because none of the evolutionary approaches (i.e. tack something
on the side) appear to be sufficient.

Cheers,

Dave.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-31  8:34     ` Andi Kleen
  2010-01-31 13:59       ` Dave Chinner
@ 2010-01-31 21:02       ` tytso
  2010-02-01 10:17         ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: tytso @ 2010-01-31 21:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Dave Chinner, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Sun, Jan 31, 2010 at 09:34:09AM +0100, Andi Kleen wrote:
> 
> The standard case is the classic updatedb. Lots of dentries/inodes cached
> with no or little corresponding data cache.
> 
> > a huge number of pages that are actively getting used, the thrashing
> > that is going to result is going to enormous.
> 
> I think the consensus so far is to try to avoid any inodes/dentries
> which are dirty or used in any way.

OK, but in that case, the kick_inodes should check to see if the inode
is in use in any way (i.e., has dentries open that will tie it down,
is open, has pages that are dirty or are mapped into some page table)
before attempting to invalidating any of its pages.  The patch as
currently constituted doesn't do that.  It will attempt to drop all
pages owned by that inode before checking for any of these conditions.
If I wanted that, I'd just do "echo 3 > /proc/sys/vm/drop_caches".  

Worse yet, *after* it does this, it tries to write out the pages the
inode.  #1, this is pointless, since if the inode had any dirty pages,
they wouldn't have been invalidated, since it calls write_inode_now()
*after* calling invalidate_mapping_pages(), so the previously dirty
pages will still be mapped, and prevent the the inode from being
flushed.  #2, it interferes with delayed allocation and becomes
another writeback path, which means some dirty pages might get flushed
too early and it does this writeout without any of the congestion
handling code in the bdi writeback paths.

If the consensus is "avoid any inodes/dentries which are dirty or
used in any way", THIS PATCH DOESN'T DO THAT.

I'd go further, and say that it should avoid trying to flush any inode
if any of its sibling inodes on the slab cache are dirty or in use in
any way.  Otherwise, you end up dropping pages from the page cache and
still not be able to do any defragmentation.

> I personally would prefer it to be more aggressive for memory offlining
> though for RAS purposes though, but just handling the unused cases is a 
> good first step.

If you want something more aggressive, why not just "echo 3 >
/proc/sys/vm/drop_caches"?  We have that already.  If the answer is,
because it will trash the performance of the system, I'm concerned
this patch series will do this --- consistently.

If the concern is that the inode cache is filled with crap after an
updatedb run, then we should fix *that* problem; we need a way for
programs like updatedb to indicate that they are scanning lots of
inodes, and if the inode wasn't in cache before it was opened, it
should be placed on the short list to be dropped after it's closed.
Making that a new open(2) flag makes a lot of sense.  Let's solve the
real problem here, if that's the concern.

But most of the time, I *want* the page cache filled, since it means
less time wasted accessing spinning rust platters.  The last thing I
want is a some helpful defragmentation kernel thread constantly
wandering through inode caches, and randomly calling
"invalidate_mapping_pages" on inodes since it thinks this will help
defrag huge pages.  If I'm not running an Oracle database on my
laptop, but instead am concerned about battery lifetime, this is the
last thing I would want.

					- Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-01-31 21:02       ` tytso
@ 2010-02-01 10:17         ` Andi Kleen
  2010-02-01 13:47           ` tytso
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2010-02-01 10:17 UTC (permalink / raw)
  To: tytso, Andi Kleen, Christoph Lameter, Dave Chinner,
	Miklos Szeredi, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm, Nick Piggin,
	Hugh Dickins, linux-kernel

On Sun, Jan 31, 2010 at 04:02:07PM -0500, tytso@mit.edu wrote:
> OK, but in that case, the kick_inodes should check to see if the inode
> is in use in any way (i.e., has dentries open that will tie it down,
> is open, has pages that are dirty or are mapped into some page table)
> before attempting to invalidating any of its pages.  The patch as
> currently constituted doesn't do that.  It will attempt to drop all
> pages owned by that inode before checking for any of these conditions.
> If I wanted that, I'd just do "echo 3 > /proc/sys/vm/drop_caches".  

Yes the patch is more aggressive and probably needs to be fixed.

On the other hand I would like to keep the option to be more aggressive
for soft page offlining where it's useful and nobody cares about 
the cost.

> Worse yet, *after* it does this, it tries to write out the pages the
> inode.  #1, this is pointless, since if the inode had any dirty pages,
> they wouldn't have been invalidated, since it calls write_inode_now()

Yes .... fought with all that for hwpoison too.

> I'd go further, and say that it should avoid trying to flush any inode
> if any of its sibling inodes on the slab cache are dirty or in use in
> any way.  Otherwise, you end up dropping pages from the page cache and
> still not be able to do any defragmentation.

It depends -- for normal operation when running low on memory I agree
with you. 
But for hwpoison soft offline purposes it's better to be more aggressive
-- even if that is inefficient -- but number one priority is to still
be correct of course. 
> 
> If the concern is that the inode cache is filled with crap after an
> updatedb run, then we should fix *that* problem; we need a way for
> programs like updatedb to indicate that they are scanning lots of
> inodes, and if the inode wasn't in cache before it was opened, it
> this patch series will do this --- consistently.

This has been tried many times and nobody came up with a good approach
to detect it automatically that doesn't have bad regressions in corner 
cases.

Or the "let's add a updatedb" hint approach has the problem that
it won't cover a lot of other programs (as Linus always points out
these new interfaces rarely actually get used)

Also as Linus always points out -- thi

> But most of the time, I *want* the page cache filled, since it means
> less time wasted accessing spinning rust platters.  The last thing I
> want is a some helpful defragmentation kernel thread constantly
> wandering through inode caches, and randomly calling

The problem right now this patch series tries to access is that
when you run out of memory it tends to blow away your dcaches caches
because the dcache reclaim is just too stupid to actually free
memory without going through most of the LRU list.

So yes it's all about improving caching.  But yes also 
some details need to be improved

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-01 10:17         ` Andi Kleen
@ 2010-02-01 13:47           ` tytso
  2010-02-01 13:54             ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: tytso @ 2010-02-01 13:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Dave Chinner, Miklos Szeredi, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 11:17:02AM +0100, Andi Kleen wrote:
> 
> On the other hand I would like to keep the option to be more aggressive
> for soft page offlining where it's useful and nobody cares about 
> the cost.

I'm not sure I understand what the goals are for "soft page
offlining".  Can you say a bit more about that?

> Or the "let's add a updatedb" hint approach has the problem that
> it won't cover a lot of other programs (as Linus always points out
> these new interfaces rarely actually get used)

Sure, but the number of programs that scan all of the files in a file
system and would need this sort of hint are actually pretty small.
Uptdatedb and backup programs are pretty much about it.

	      	     	      	  	      - Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: inodes: Support generic defragmentation
  2010-02-01 13:47           ` tytso
@ 2010-02-01 13:54             ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2010-02-01 13:54 UTC (permalink / raw)
  To: tytso, Andi Kleen, Christoph Lameter, Dave Chinner,
	Miklos Szeredi, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm, Nick Piggin,
	Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 08:47:39AM -0500, tytso@mit.edu wrote:
> On Mon, Feb 01, 2010 at 11:17:02AM +0100, Andi Kleen wrote:
> > 
> > On the other hand I would like to keep the option to be more aggressive
> > for soft page offlining where it's useful and nobody cares about 
> > the cost.
> 
> I'm not sure I understand what the goals are for "soft page
> offlining".  Can you say a bit more about that?

Predictive offlining of memory pages based on corrected error counts.
This is a useful feature to get more out of lower quality (and even
high quality) DIMMs.

This is already implemented in mcelog+.33ish memory-failure.c, but right
now it's quite dumb when trying to free a dcache/inode page (it basically
always has to blow away everything)

Also this is just one use case for this. The other would be 2MB
page at runtime support by doing targetted freeing (would be especially
useful with the upcoming transparent huge pages). Probably others
too. I merely mostly quoted hwpoison because I happen to work on that.

> 
> > Or the "let's add a updatedb" hint approach has the problem that
> > it won't cover a lot of other programs (as Linus always points out
> > these new interfaces rarely actually get used)
> 
> Sure, but the number of programs that scan all of the files in a file
> system and would need this sort of hint are actually pretty small.

Not sure that's true.

Also consider a file server :- how would you get that hint from the 
clients?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Filesystem: Ext2 filesystem defrag
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (10 preceding siblings ...)
  2010-01-29 20:49 ` inodes: Support generic defragmentation Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` Filesystem: Ext3 " Christoph Lameter
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: ext2-defrag --]
[-- Type: text/plain, Size: 1098 bytes --]

Support defragmentation for ext2 filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/ext2/super.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: slab-2.6/fs/ext2/super.c
===================================================================
--- slab-2.6.orig/fs/ext2/super.c	2010-01-22 15:09:43.000000000 -0600
+++ slab-2.6/fs/ext2/super.c	2010-01-22 16:20:46.000000000 -0600
@@ -174,6 +174,12 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
@@ -183,6 +189,9 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
+
+	kmem_cache_setup_defrag(ext2_inode_cachep,
+			ext2_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Filesystem: Ext3 filesystem defrag
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (11 preceding siblings ...)
  2010-01-29 20:49 ` Filesystem: Ext2 filesystem defrag Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` Filesystem: Ext4 " Christoph Lameter
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: ext3-defrag --]
[-- Type: text/plain, Size: 1101 bytes --]

Support defragmentation for ext3 filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/ext3/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-next/fs/ext3/super.c
===================================================================
--- linux-next.orig/fs/ext3/super.c	2008-08-11 07:42:10.348607875 -0700
+++ linux-next/fs/ext3/super.c	2008-08-11 07:47:05.042348829 -0700
@@ -484,6 +484,12 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
@@ -493,6 +499,8 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext3_inode_cachep,
+			ext3_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Filesystem: Ext4 filesystem defrag
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (12 preceding siblings ...)
  2010-01-29 20:49 ` Filesystem: Ext3 " Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` Filesystem: XFS slab defragmentation Christoph Lameter
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: ext4-defrag --]
[-- Type: text/plain, Size: 1095 bytes --]

Support defragmentation for extX filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/ext4/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: slab-2.6/fs/ext4/super.c
===================================================================
--- slab-2.6.orig/fs/ext4/super.c	2010-01-22 15:09:43.000000000 -0600
+++ slab-2.6/fs/ext4/super.c	2010-01-22 16:21:07.000000000 -0600
@@ -741,6 +741,12 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
@@ -750,6 +756,8 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext4_inode_cachep,
+			ext4_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Filesystem: XFS slab defragmentation
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (13 preceding siblings ...)
  2010-01-29 20:49 ` Filesystem: Ext4 " Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` Filesystems: /proc filesystem support for slab defrag Christoph Lameter
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: xfs_defrag --]
[-- Type: text/plain, Size: 818 bytes --]

Support inode defragmentation for xfs

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/xfs/linux-2.6/xfs_super.c |    2 ++
 1 file changed, 2 insertions(+)

Index: slab-2.6/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- slab-2.6.orig/fs/xfs/linux-2.6/xfs_super.c	2010-01-22 15:09:43.000000000 -0600
+++ slab-2.6/fs/xfs/linux-2.6/xfs_super.c	2010-01-22 16:23:51.000000000 -0600
@@ -1608,6 +1608,8 @@ xfs_init_zones(void)
 	if (!xfs_ioend_zone)
 		goto out;
 
+	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
+
 	xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
 						  xfs_ioend_zone);
 	if (!xfs_ioend_pool)

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Filesystems: /proc filesystem support for slab defrag
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (14 preceding siblings ...)
  2010-01-29 20:49 ` Filesystem: XFS slab defragmentation Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 20:49 ` dentries: dentry defragmentation Christoph Lameter
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Alexey Dobriyan, Christoph Lameter, Rik van Riel,
	Pekka Enberg, akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins,
	linux-kernel

[-- Attachment #1: defrag_proc --]
[-- Type: text/plain, Size: 1186 bytes --]

Support procfs inode defragmentation

Cc: Alexey Dobriyan <adobriyan@sw.ru>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/proc/inode.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/fs/proc/inode.c
===================================================================
--- linux-2.6.orig/fs/proc/inode.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/fs/proc/inode.c	2010-01-29 10:33:22.000000000 -0600
@@ -77,6 +77,12 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct proc_inode, vfs_inode));
+}
+
 void __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -84,6 +90,8 @@ void __init proc_init_inodecache(void)
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD|SLAB_PANIC),
 					     init_once);
+	kmem_cache_setup_defrag(proc_inode_cachep,
+				proc_get_inodes, kick_inodes)
 }
 
 static const struct super_operations proc_sops = {

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* dentries: dentry defragmentation
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (15 preceding siblings ...)
  2010-01-29 20:49 ` Filesystems: /proc filesystem support for slab defrag Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-29 22:00   ` Al Viro
  2010-01-29 20:49 ` slub defrag: Transition patch upstream -> -next Christoph Lameter
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: defrag_dentry --]
[-- Type: text/plain, Size: 4297 bytes --]

The dentry pruning for unused entries works in a straightforward way. It
could be made more aggressive if one would actually move dentries instead
of just reclaiming them.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 fs/dcache.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/fs/dcache.c	2010-01-29 12:10:37.000000000 -0600
@@ -33,6 +33,7 @@
 #include <linux/bootmem.h>
 #include <linux/fs_struct.h>
 #include <linux/hardirq.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 
 int sysctl_vfs_cache_pressure __read_mostly = 100;
@@ -173,7 +174,10 @@ static struct dentry *d_kill(struct dent
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
-	/*drops the locks, at that point nobody can reach this dentry */
+	/*
+	 * drops the locks, at that point nobody (aside from defrag)
+	 * can reach this dentry
+	 */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
 		parent = NULL;
@@ -2263,6 +2267,100 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		/*
+		 * Three sorts of dentries cannot be reclaimed:
+		 *
+		 * 1. dentries that are in the process of being allocated
+		 *    or being freed. In that case the dentry is neither
+		 *    on the LRU nor hashed.
+		 *
+		 * 2. Fake hashed entries as used for anonymous dentries
+		 *    and pipe I/O. The fake hashed entries have d_flags
+		 *    set to indicate a hashed entry. However, the
+		 *    d_hash field indicates that the entry is not hashed.
+		 *
+		 * 3. dentries that have a backing store that is not
+		 *    writable. This is true for tmpsfs and other in
+		 *    memory filesystems. Removing dentries from them
+		 *    would loose dentries for good.
+		 */
+		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+		   (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
+		   (dentry->d_inode &&
+		   !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+			/* Ignore this dentry */
+			v[i] = NULL;
+		else
+			/* dget_locked will remove the dentry from the LRU */
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+				int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		prune_one_dentry(dentry);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations are complete.
+	 */
+	synchronize_rcu();
+}
+
 static void __init dcache_init(void)
 {
 	int loop;
@@ -2276,6 +2374,7 @@ static void __init dcache_init(void)
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
 	
 	register_shrinker(&dcache_shrinker);
+	kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-01-29 20:49 ` dentries: dentry defragmentation Christoph Lameter
@ 2010-01-29 22:00   ` Al Viro
  2010-02-01  7:08     ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Al Viro @ 2010-01-29 22:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Dave Chinner, Alexander Viro, Christoph Hellwig,
	Christoph Lameter, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Fri, Jan 29, 2010 at 02:49:48PM -0600, Christoph Lameter wrote:
> +		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
> +		   (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
> +		   (dentry->d_inode &&
> +		   !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
> +			/* Ignore this dentry */
> +			v[i] = NULL;
> +		else
> +			/* dget_locked will remove the dentry from the LRU */
> +			dget_locked(dentry);
> +	}
> +	spin_unlock(&dcache_lock);
> +	return NULL;
> +}

No.  As the matter of fact - fuck, no.  For one thing, it's going to race
with umount.  For another, kicking busy dentry out of hash is worse than
useless - you are just asking to get more and more copies of that sucker
in dcache.  This is fundamentally bogus, especially since there is a 100%
safe time for killing dentry - when dput() drives the refcount to 0 and
you *are* doing dput() on the references you've acquired.  If anything, I'd
suggest setting a flag that would trigger immediate freeing on the final
dput().

And that does not cover the umount races.  You *can't* go around grabbing
dentries without making sure that superblock won't be shut down under
you.  And no, I don't know how to deal with that cleanly - simply bumping
superblock ->s_count under sb_lock is enough to make sure it's not freed
under you, but what you want is more than that.  An active reference would
be enough, except that you'd get sudden "oh, sorry, now there's no way
to make sure that superblock is shut down at umount(2), no matter what kind
of setup you have".  So you really need to get ->s_umount held shared,
which is, not particulary locking-order-friendly, to put it mildly.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-01-29 22:00   ` Al Viro
@ 2010-02-01  7:08     ` Nick Piggin
  2010-02-01 10:10       ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2010-02-01  7:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Lameter, Andi Kleen, Dave Chinner, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Fri, Jan 29, 2010 at 10:00:44PM +0000, Al Viro wrote:
> On Fri, Jan 29, 2010 at 02:49:48PM -0600, Christoph Lameter wrote:
> > +		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
> > +		   (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
> > +		   (dentry->d_inode &&
> > +		   !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
> > +			/* Ignore this dentry */
> > +			v[i] = NULL;
> > +		else
> > +			/* dget_locked will remove the dentry from the LRU */
> > +			dget_locked(dentry);
> > +	}
> > +	spin_unlock(&dcache_lock);
> > +	return NULL;
> > +}
> 
> No.  As the matter of fact - fuck, no.  For one thing, it's going to race
> with umount.  For another, kicking busy dentry out of hash is worse than
> useless - you are just asking to get more and more copies of that sucker
> in dcache.  This is fundamentally bogus, especially since there is a 100%
> safe time for killing dentry - when dput() drives the refcount to 0 and
> you *are* doing dput() on the references you've acquired.  If anything, I'd
> suggest setting a flag that would trigger immediate freeing on the final
> dput().
> 
> And that does not cover the umount races.  You *can't* go around grabbing
> dentries without making sure that superblock won't be shut down under
> you.  And no, I don't know how to deal with that cleanly - simply bumping
> superblock ->s_count under sb_lock is enough to make sure it's not freed
> under you, but what you want is more than that.  An active reference would
> be enough, except that you'd get sudden "oh, sorry, now there's no way
> to make sure that superblock is shut down at umount(2), no matter what kind
> of setup you have".  So you really need to get ->s_umount held shared,
> which is, not particulary locking-order-friendly, to put it mildly.

I always preferred to do defrag in the opposite way. Ie. query the
slab allocator from existing shrinkers rather than opposite way
around. This lets you reuse more of the locking and refcounting etc.

So you have a pin on the object somehow via the normal shrinker path,
and therefore you get a pin on the underlying slab. I would just like
to see even performance of a real simple approach that just asks
whether we are in this slab defrag mode, and if so, whether the slab
is very sparse. If yes, then reclaim aggressively.

If that doesn't perform well enough and you have to go further and
discover objects on the same slab, then it does get a bit more
tricky because:
- you need the pin on the first object in order to discover more
- discovered objects may not be expected in the existing shrinker
  code that just picks objects off LRUs

However your code already has to handle the 2nd case anyway, and for
the 1st case it is probably not too hard to do with dcache/icache. And
in either case you seem to avoid the worst of the sleeping and lock
ordering and slab inversion problems of your ->get approach.

But I'm really interested to see numbers, and especially numbers of
the simpler approaches before adding this complexity.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01  7:08     ` Nick Piggin
@ 2010-02-01 10:10       ` Andi Kleen
  2010-02-01 10:16         ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2010-02-01 10:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, Christoph Lameter, Andi Kleen, Dave Chinner,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi, Nick Piggin,
	Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> I always preferred to do defrag in the opposite way. Ie. query the
> slab allocator from existing shrinkers rather than opposite way
> around. This lets you reuse more of the locking and refcounting etc.

I looked at this for hwpoison soft offline.

But it works really badly because the LRU list ordering 
has nothing to do with the actual ordering inside the slab pages.

Christoph's basic approach is more efficient.

> So you have a pin on the object somehow via the normal shrinker path,
> and therefore you get a pin on the underlying slab. I would just like
> to see even performance of a real simple approach that just asks
> whether we are in this slab defrag mode, and if so, whether the slab
> is very sparse. If yes, then reclaim aggressively.

The typical result is that you need to get through most of the LRU
list (and prune them all) just to free the page.

> 
> If that doesn't perform well enough and you have to go further and

It doesn't.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 10:10       ` Andi Kleen
@ 2010-02-01 10:16         ` Nick Piggin
  2010-02-01 10:22           ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2010-02-01 10:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Al Viro, Christoph Lameter, Dave Chinner, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 11:10:13AM +0100, Andi Kleen wrote:
> On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> > I always preferred to do defrag in the opposite way. Ie. query the
> > slab allocator from existing shrinkers rather than opposite way
> > around. This lets you reuse more of the locking and refcounting etc.
> 
> I looked at this for hwpoison soft offline.
> 
> But it works really badly because the LRU list ordering 
> has nothing to do with the actual ordering inside the slab pages.

No, you don't *have* to follow LRU order. The most important thing
is if you followed what I wrote is to get a pin on the objects and
the slabs via the regular shrinker path first, then querying slab
rather than calling into all these subsystems from an atomic, and
non-slab-reentrant path.

Following LRU order would just be the first and simplest cut at
this.

 
> Christoph's basic approach is more efficient.

I want to see numbers because it is also the far more complex
approach.

 
> > So you have a pin on the object somehow via the normal shrinker path,
> > and therefore you get a pin on the underlying slab. I would just like
> > to see even performance of a real simple approach that just asks
> > whether we are in this slab defrag mode, and if so, whether the slab
> > is very sparse. If yes, then reclaim aggressively.
> 
> The typical result is that you need to get through most of the LRU
> list (and prune them all) just to free the page.

Really? If you have a large proportion of slabs which are quite
internally fragmented, then I would have thought it would give a
significant improvement (aggressive reclaim, that is).


> > If that doesn't perform well enough and you have to go further and
> 
> It doesn't.

Can we see your numbers? And the patches you tried?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 10:16         ` Nick Piggin
@ 2010-02-01 10:22           ` Andi Kleen
  2010-02-01 10:35             ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2010-02-01 10:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Al Viro, Christoph Lameter, Dave Chinner,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi, Nick Piggin,
	Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 09:16:45PM +1100, Nick Piggin wrote:
> On Mon, Feb 01, 2010 at 11:10:13AM +0100, Andi Kleen wrote:
> > On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> > > I always preferred to do defrag in the opposite way. Ie. query the
> > > slab allocator from existing shrinkers rather than opposite way
> > > around. This lets you reuse more of the locking and refcounting etc.
> > 
> > I looked at this for hwpoison soft offline.
> > 
> > But it works really badly because the LRU list ordering 
> > has nothing to do with the actual ordering inside the slab pages.
> 
> No, you don't *have* to follow LRU order. The most important thing

What list would you follow then?

There's LRU, there's hast (which is as random) and there's slab
itself. The only one who is guaranteed to match the physical
layout in memory is slab. That is what this patchkit is trying
to attempt.

> is if you followed what I wrote is to get a pin on the objects and

Which objects? You first need to collect all that belong to a page.
How else would you do that?

> > > whether we are in this slab defrag mode, and if so, whether the slab
> > > is very sparse. If yes, then reclaim aggressively.
> > 
> > The typical result is that you need to get through most of the LRU
> > list (and prune them all) just to free the page.
> 
> Really? If you have a large proportion of slabs which are quite
> internally fragmented, then I would have thought it would give a
> significant improvement (aggressive reclaim, that is)

You wrote the same as me?

> 
> 
> > > If that doesn't perform well enough and you have to go further and
> > 
> > It doesn't.
> 
> Can we see your numbers? And the patches you tried?

What I tried (in some dirty patches you probably don't want to see)
was to just implement slab shrinking for a single page for soft hwpoison.
But it didn't work too well because it couldn't free the objects
still actually in the dcache.

Then I called the shrinker and tried to pass in the page as a hint
and drop only objects on that page, but I realized that it's terrible
inefficient to do it this way.

Now soft hwpoison doesn't care about a little inefficiency, but I still
didn't like to be terrible inefficient.

That is why I asked Christoph to repost his old patchkit that can 
do the shrink from the slab side (which is the right order here)

BTW the other potential user for this would be defragmentation
for large page allocation.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 10:22           ` Andi Kleen
@ 2010-02-01 10:35             ` Nick Piggin
  2010-02-01 10:45               ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2010-02-01 10:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Al Viro, Christoph Lameter, Dave Chinner, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 11:22:53AM +0100, Andi Kleen wrote:
> On Mon, Feb 01, 2010 at 09:16:45PM +1100, Nick Piggin wrote:
> > On Mon, Feb 01, 2010 at 11:10:13AM +0100, Andi Kleen wrote:
> > > On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> > > > I always preferred to do defrag in the opposite way. Ie. query the
> > > > slab allocator from existing shrinkers rather than opposite way
> > > > around. This lets you reuse more of the locking and refcounting etc.
> > > 
> > > I looked at this for hwpoison soft offline.
> > > 
> > > But it works really badly because the LRU list ordering 
> > > has nothing to do with the actual ordering inside the slab pages.
> > 
> > No, you don't *have* to follow LRU order. The most important thing
> 
> What list would you follow then?

You can follow the slab, as I said in the first mail.

> There's LRU, there's hast (which is as random) and there's slab
> itself. The only one who is guaranteed to match the physical
> layout in memory is slab. That is what this patchkit is trying
> to attempt.
> 
> > is if you followed what I wrote is to get a pin on the objects and
> 
> Which objects? You first need to collect all that belong to a page.
> How else would you do that?

Objects that you're interested in reclaiming, I guess. I don't
understand the question.

 
> > > > whether we are in this slab defrag mode, and if so, whether the slab
> > > > is very sparse. If yes, then reclaim aggressively.
> > > 
> > > The typical result is that you need to get through most of the LRU
> > > list (and prune them all) just to free the page.
> > 
> > Really? If you have a large proportion of slabs which are quite
> > internally fragmented, then I would have thought it would give a
> > significant improvement (aggressive reclaim, that is)
> 
> 
> You wrote the same as me?

Aggressive reclaim: as-in, ignoring referenced bit on the LRU,
*possibly* even trying to actively invalidate the dentry.


> > > > If that doesn't perform well enough and you have to go further and
> > > 
> > > It doesn't.
> > 
> > Can we see your numbers? And the patches you tried?
> 
> What I tried (in some dirty patches you probably don't want to see)
> was to just implement slab shrinking for a single page for soft hwpoison.
> But it didn't work too well because it couldn't free the objects
> still actually in the dcache.
> 
> Then I called the shrinker and tried to pass in the page as a hint
> and drop only objects on that page, but I realized that it's terrible
> inefficient to do it this way.
> 
> Now soft hwpoison doesn't care about a little inefficiency, but I still
> didn't like to be terrible inefficient.
> 
> That is why I asked Christoph to repost his old patchkit that can 
> do the shrink from the slab side (which is the right order here)

Right, but as you can see it is complex to do it this way. And I
think for reclaim driven targetted reclaim, then it needn't be so
inefficient because you aren't restricted to just one page, but
in any page which is heavily fragmented (and by definition there
should be a lot of them in the system).

Hwpoison I don't think adds much weight, frankly. Just panic and
reboot if you get unrecoverable error. We have everything to handle
that so I can't see how it's worth adding much complexity to the
kernel for.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 10:35             ` Nick Piggin
@ 2010-02-01 10:45               ` Andi Kleen
  2010-02-01 10:56                 ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2010-02-01 10:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Al Viro, Christoph Lameter, Dave Chinner,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi, Nick Piggin,
	Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 09:35:26PM +1100, Nick Piggin wrote:
> > > > > I always preferred to do defrag in the opposite way. Ie. query the
> > > > > slab allocator from existing shrinkers rather than opposite way
> > > > > around. This lets you reuse more of the locking and refcounting etc.
> > > > 
> > > > I looked at this for hwpoison soft offline.
> > > > 
> > > > But it works really badly because the LRU list ordering 
> > > > has nothing to do with the actual ordering inside the slab pages.
> > > 
> > > No, you don't *have* to follow LRU order. The most important thing
> > 
> > What list would you follow then?
> 
> You can follow the slab, as I said in the first mail.

That's pretty much what Christoph's patchkit is about (with yes some details
improved)

> 
> > There's LRU, there's hast (which is as random) and there's slab
> > itself. The only one who is guaranteed to match the physical
> > layout in memory is slab. That is what this patchkit is trying
> > to attempt.
> > 
> > > is if you followed what I wrote is to get a pin on the objects and
> > 
> > Which objects? You first need to collect all that belong to a page.
> > How else would you do that?
> 
> Objects that you're interested in reclaiming, I guess. I don't
> understand the question.

Objects that are in the same page

There are really two different cases here:
- Run out of memory: in this case i just want to find all the objects
of any page, ideally of not that recently used pages.
- I am very fragmented and want a specific page freed to get a 2MB
region back or for hwpoison:  same, but do it for a specific page.


> Right, but as you can see it is complex to do it this way. And I
> think for reclaim driven targetted reclaim, then it needn't be so
> inefficient because you aren't restricted to just one page, but
> in any page which is heavily fragmented (and by definition there
> should be a lot of them in the system).

Assuming you can identify them quickly.

> 
> Hwpoison I don't think adds much weight, frankly. Just panic and
> reboot if you get unrecoverable error. We have everything to handle

This is for soft hwpoison :- offlining pages that might go bad
in the future.

But soft hwpoison isn't the only user. The other big one would
be for large pages or other large page allocations.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 10:45               ` Andi Kleen
@ 2010-02-01 10:56                 ` Nick Piggin
  2010-02-01 13:25                   ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2010-02-01 10:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Al Viro, Christoph Lameter, Dave Chinner, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 11:45:44AM +0100, Andi Kleen wrote:
> On Mon, Feb 01, 2010 at 09:35:26PM +1100, Nick Piggin wrote:
> > > > > > I always preferred to do defrag in the opposite way. Ie. query the
> > > > > > slab allocator from existing shrinkers rather than opposite way
> > > > > > around. This lets you reuse more of the locking and refcounting etc.
> > > > > 
> > > > > I looked at this for hwpoison soft offline.
> > > > > 
> > > > > But it works really badly because the LRU list ordering 
> > > > > has nothing to do with the actual ordering inside the slab pages.
> > > > 
> > > > No, you don't *have* to follow LRU order. The most important thing
> > > 
> > > What list would you follow then?
> > 
> > You can follow the slab, as I said in the first mail.
> 
> That's pretty much what Christoph's patchkit is about (with yes some details
> improved)

I know what the patch is about. Can you re-read my first mail?


> > > There's LRU, there's hast (which is as random) and there's slab
> > > itself. The only one who is guaranteed to match the physical
> > > layout in memory is slab. That is what this patchkit is trying
> > > to attempt.
> > > 
> > > > is if you followed what I wrote is to get a pin on the objects and
> > > 
> > > Which objects? You first need to collect all that belong to a page.
> > > How else would you do that?
> > 
> > Objects that you're interested in reclaiming, I guess. I don't
> > understand the question.
> 
> Objects that are in the same page

OK, well you can pin an object, and from there you can find other
objects in the same page.

This is totally different to how Christoph's patch has to pin the
slab, then (in a restrictive context) pin the objects, then go to
a more relaxed context to reclaim the objects. This is where much
of the complexity comes from.


> There are really two different cases here:
> - Run out of memory: in this case i just want to find all the objects
> of any page, ideally of not that recently used pages.
> - I am very fragmented and want a specific page freed to get a 2MB
> region back or for hwpoison:  same, but do it for a specific page.
> 
> 
> > Right, but as you can see it is complex to do it this way. And I
> > think for reclaim driven targetted reclaim, then it needn't be so
> > inefficient because you aren't restricted to just one page, but
> > in any page which is heavily fragmented (and by definition there
> > should be a lot of them in the system).
> 
> Assuming you can identify them quickly.

Well because there are a large number of them, then you are likely
to encounter one very quickly just off the LRU list.


> > Hwpoison I don't think adds much weight, frankly. Just panic and
> > reboot if you get unrecoverable error. We have everything to handle
> 
> This is for soft hwpoison :- offlining pages that might go bad
> in the future.

I still don't think it adds much weight. Especially if you can just
try an inefficient scan.
 

> But soft hwpoison isn't the only user. The other big one would
> be for large pages or other large page allocations.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 10:56                 ` Nick Piggin
@ 2010-02-01 13:25                   ` Andi Kleen
  2010-02-01 13:36                     ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2010-02-01 13:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Al Viro, Christoph Lameter, Dave Chinner,
	Alexander Viro, Christoph Hellwig, Christoph Lameter,
	Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi, Nick Piggin,
	Hugh Dickins, linux-kernel

 > 
> > > Right, but as you can see it is complex to do it this way. And I
> > > think for reclaim driven targetted reclaim, then it needn't be so
> > > inefficient because you aren't restricted to just one page, but
> > > in any page which is heavily fragmented (and by definition there
> > > should be a lot of them in the system).
> > 
> > Assuming you can identify them quickly.
> 
> Well because there are a large number of them, then you are likely
> to encounter one very quickly just off the LRU list.

There were some cases in the past where this wasn't the case.
But yes some uptodate numbers on this would be good.

Also it doesn't address the second case here quoted again.

> > There are really two different cases here:
> > - Run out of memory: in this case i just want to find all the objects
> > of any page, ideally of not that recently used pages.
> > - I am very fragmented and want a specific page freed to get a 2MB
> > region back or for hwpoison:  same, but do it for a specific page.
> > 
>
> 
> I still don't think it adds much weight. Especially if you can just
> try an inefficient scan.

Also see second point below.
>  
> 
> > But soft hwpoison isn't the only user. The other big one would
> > be for large pages or other large page allocations.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: dentries: dentry defragmentation
  2010-02-01 13:25                   ` Andi Kleen
@ 2010-02-01 13:36                     ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2010-02-01 13:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Al Viro, Christoph Lameter, Dave Chinner, Alexander Viro,
	Christoph Hellwig, Christoph Lameter, Rik van Riel, Pekka Enberg,
	akpm, Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel

On Mon, Feb 01, 2010 at 02:25:27PM +0100, Andi Kleen wrote:
>  > 
> > > > Right, but as you can see it is complex to do it this way. And I
> > > > think for reclaim driven targetted reclaim, then it needn't be so
> > > > inefficient because you aren't restricted to just one page, but
> > > > in any page which is heavily fragmented (and by definition there
> > > > should be a lot of them in the system).
> > > 
> > > Assuming you can identify them quickly.
> > 
> > Well because there are a large number of them, then you are likely
> > to encounter one very quickly just off the LRU list.
> 
> There were some cases in the past where this wasn't the case.
> But yes some uptodate numbers on this would be good.
> 
> Also it doesn't address the second case here quoted again.
> 
> > > There are really two different cases here:
> > > - Run out of memory: in this case i just want to find all the objects
> > > of any page, ideally of not that recently used pages.
> > > - I am very fragmented and want a specific page freed to get a 2MB
> > > region back or for hwpoison:  same, but do it for a specific page.
> > > 
> >
> > 
> > I still don't think it adds much weight. Especially if you can just
> > try an inefficient scan.
> 
> Also see second point below.
> >  
> > 
> > > But soft hwpoison isn't the only user. The other big one would
> > > be for large pages or other large page allocations.

Well yes it's possible that it could help there.

But it is always possible to do the same reclaim work via the LRU, in
worst case it just requires reclaiming of most objects.  So it
probably doesn't fundamentally enable something we can't do already.
More a matter of performance, so again, numbers are needed.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* slub defrag: Transition patch upstream -> -next
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (16 preceding siblings ...)
  2010-01-29 20:49 ` dentries: dentry defragmentation Christoph Lameter
@ 2010-01-29 20:49 ` Christoph Lameter
  2010-01-30  8:54 ` Slab Fragmentation Reduction V15 Pekka Enberg
  2010-01-30 10:48 ` Andi Kleen
  19 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-01-29 20:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi,
	Nick Piggin, Hugh Dickins, linux-kernel

[-- Attachment #1: fixup-next --]
[-- Type: text/plain, Size: 2222 bytes --]

The slub statistics have been simplified through the use of per cpu
operations in -next. In order for this patchset to compile applied to the
 -next tree the following changes need to be made.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 mm/slub.c |   14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c	2010-01-29 12:26:52.000000000 -0600
+++ slab-2.6/mm/slub.c	2010-01-29 12:28:28.000000000 -0600
@@ -2906,7 +2906,6 @@ static int kmem_cache_vacate(struct page
 	void *private;
 	unsigned long flags;
 	unsigned long objects;
-	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
 	slab_lock(page);
@@ -2955,13 +2954,12 @@ out:
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
-	c = get_cpu_slab(s, smp_processor_id());
 	if (leftover) {
 		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
-		stat(c, SHRINK_OBJECT_RECLAIM_FAILED);
+		stat(s, SHRINK_OBJECT_RECLAIM_FAILED);
 		__ClearPageSlubKickable(page);
 	} else
-		stat(c, SHRINK_SLAB_RECLAIMED);
+		stat(s, SHRINK_SLAB_RECLAIMED);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -3012,14 +3010,12 @@ static unsigned long __kmem_cache_shrink
 	LIST_HEAD(zaplist);
 	int freed = 0;
 	struct kmem_cache_node *n = get_node(s, node);
-	struct kmem_cache_cpu *c;
 
 	if (n->nr_partial <= limit)
 		return 0;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	c = get_cpu_slab(s, smp_processor_id());
-	stat(c, SHRINK_CALLS);
+	stat(s, SHRINK_CALLS);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
 		if (!slab_trylock(page))
 			/* Busy slab. Get out of the way */
@@ -3039,14 +3035,14 @@ static unsigned long __kmem_cache_shrink
 
 			list_move(&page->lru, &zaplist);
 			if (s->kick) {
-				stat(c, SHRINK_ATTEMPT_DEFRAG);
+				stat(s, SHRINK_ATTEMPT_DEFRAG);
 				n->nr_partial--;
 				__SetPageSlubFrozen(page);
 			}
 			slab_unlock(page);
 		} else {
 			/* Empty slab page */
-			stat(c, SHRINK_EMPTY_SLAB);
+			stat(s, SHRINK_EMPTY_SLAB);
 			list_del(&page->lru);
 			n->nr_partial--;
 			slab_unlock(page);

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Slab Fragmentation Reduction V15
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (17 preceding siblings ...)
  2010-01-29 20:49 ` slub defrag: Transition patch upstream -> -next Christoph Lameter
@ 2010-01-30  8:54 ` Pekka Enberg
  2010-01-30 10:48 ` Andi Kleen
  19 siblings, 0 replies; 56+ messages in thread
From: Pekka Enberg @ 2010-01-30  8:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Dave Chinner, Rik van Riel, akpm, Miklos Szeredi,
	Nick Piggin, Hugh Dickins, linux-kernel

On Fri, Jan 29, 2010 at 10:49 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> This is one of these year long projects to address fundamental issues in the
> Linux VM. The problem is that sparse use of objects in slab caches can cause
> large amounts of memory to become unusable. The first ideas to address this
> were developed in 2005 by various people. Some of the issues with SLAB that
> we discovered while prototyping these ideas also contributed to the locking
> design in SLUB which is highly decentralized and allows stabilizing the object
> state slab wise by taking a per slab lock.
>
> This patchset was first proposed in the beginning of 2007. It was almost merged
> in 2008 when last minute objections arose in the way this interacts with
> filesystem objects (inode/dentry).

Yeah, I think the SLUB bits were fine but there wasn't clear whether
or not the FS bits would be merged. No point in merging functionality
in SLUB unless it's going to be used.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Slab Fragmentation Reduction V15
  2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
                   ` (18 preceding siblings ...)
  2010-01-30  8:54 ` Slab Fragmentation Reduction V15 Pekka Enberg
@ 2010-01-30 10:48 ` Andi Kleen
  2010-01-30 14:53   ` Rik van Riel
  2010-02-01 17:52   ` Christoph Lameter
  19 siblings, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2010-01-30 10:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Dave Chinner, Rik van Riel, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel, viro

On Fri, Jan 29, 2010 at 02:49:31PM -0600, Christoph Lameter wrote:
> This patchset was first proposed in the beginning of 2007. It was almost merged
> in 2008 when last minute objections arose in the way this interacts with
> filesystem objects (inode/dentry).
> 
> Andi has asked that we reconsider this issue. So I have updated the patchset

Thanks for reposting.

My motivation here is to improve hwpoison soft offlining, but I think
having this would be a general improvement.

> to apply against current upstream (and also -next with a special patch
> at the end). The issues with icache/dentry locking remain. In order
> for this to be merged we would have to come up with a revised dentry/inode
> locking code that can
> 
> 	1. Establish a reference to an dentry/inode so that it is pinned.
>            Hopefully in a way that is not too expensive (i.e. no superblock
>            lock)
> 
> 	2. A means to free a dentry/inode objects from the VM reclaim context.


Al, do you have a suggestions on a good way to do that?

I guess the problem could be simplified by ignoring dentries in "unusual"
states?

> The other objection against this patchset was that it does not support
> reclaim through SLAB. It is possible to add this type of support to SLAB too

I think not supporting SLAB/SLOB is fine.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Slab Fragmentation Reduction V15
  2010-01-30 10:48 ` Andi Kleen
@ 2010-01-30 14:53   ` Rik van Riel
  2010-02-01 17:53     ` Christoph Lameter
  2010-02-01 17:52   ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: Rik van Riel @ 2010-01-30 14:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Dave Chinner, Pekka Enberg, akpm,
	Miklos Szeredi, Nick Piggin, Hugh Dickins, linux-kernel, viro

On 01/30/2010 05:48 AM, Andi Kleen wrote:
> On Fri, Jan 29, 2010 at 02:49:31PM -0600, Christoph Lameter wrote:

>> 	1. Establish a reference to an dentry/inode so that it is pinned.
>>             Hopefully in a way that is not too expensive (i.e. no superblock
>>             lock)
>>
>> 	2. A means to free a dentry/inode objects from the VM reclaim context.
>
>
> Al, do you have a suggestions on a good way to do that?

You cannot free inode objects for files that are open, mmapped, etc.

> I guess the problem could be simplified by ignoring dentries in "unusual"
> states?

You mean dentries that are in use? :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Slab Fragmentation Reduction V15
  2010-01-30 14:53   ` Rik van Riel
@ 2010-02-01 17:53     ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-02-01 17:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Dave Chinner, Pekka Enberg, akpm, Miklos Szeredi,
	Nick Piggin, Hugh Dickins, linux-kernel, viro

On Sat, 30 Jan 2010, Rik van Riel wrote:

> On 01/30/2010 05:48 AM, Andi Kleen wrote:
> > On Fri, Jan 29, 2010 at 02:49:31PM -0600, Christoph Lameter wrote:
>
> > > 	1. Establish a reference to an dentry/inode so that it is pinned.
> > >             Hopefully in a way that is not too expensive (i.e. no
> > > superblock
> > >             lock)
> > >
> > > 	2. A means to free a dentry/inode objects from the VM reclaim context.
> >
> >
> > Al, do you have a suggestions on a good way to do that?
>
> You cannot free inode objects for files that are open, mmapped, etc.

Of course. Those objects need to prevent reclaim attempts.

> > I guess the problem could be simplified by ignoring dentries in "unusual"
> > states?
>
> You mean dentries that are in use? :)

The existing patch already tried to discern that and avoid the reclaim of
these.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Slab Fragmentation Reduction V15
  2010-01-30 10:48 ` Andi Kleen
  2010-01-30 14:53   ` Rik van Riel
@ 2010-02-01 17:52   ` Christoph Lameter
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2010-02-01 17:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Rik van Riel, Pekka Enberg, akpm, Miklos Szeredi,
	Nick Piggin, Hugh Dickins, linux-kernel, viro

On Sat, 30 Jan 2010, Andi Kleen wrote:

> I guess the problem could be simplified by ignoring dentries in "unusual"
> states?

Sure.


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2010-02-08 22:13 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-29 20:49 Slab Fragmentation Reduction V15 Christoph Lameter
2010-01-29 20:49 ` slub: Add defrag_ratio field and sysfs support Christoph Lameter
2010-01-29 20:49 ` slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
2010-01-29 20:49 ` slub: Add get() and kick() methods Christoph Lameter
2010-01-29 20:49 ` slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
2010-01-29 20:49 ` slub: Slab defrag core Christoph Lameter
2010-01-29 20:49 ` slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
2010-01-29 20:49 ` slub: Extend slabinfo to support -D and -F options Christoph Lameter
2010-01-29 20:49 ` slub/slabinfo: add defrag statistics Christoph Lameter
2010-01-29 20:49 ` slub: Trigger defragmentation from memory reclaim Christoph Lameter
2010-01-29 20:49 ` buffer heads: Support slab defrag Christoph Lameter
2010-01-30  1:59   ` Dave Chinner
2010-02-01  6:39   ` Nick Piggin
2010-01-29 20:49 ` inodes: Support generic defragmentation Christoph Lameter
2010-01-30  2:43   ` Dave Chinner
2010-02-01 17:50     ` Christoph Lameter
2010-01-30 19:26   ` tytso
2010-01-31  8:34     ` Andi Kleen
2010-01-31 13:59       ` Dave Chinner
2010-02-03 15:31         ` Christoph Lameter
2010-02-04  0:34           ` Dave Chinner
2010-02-04  3:07             ` tytso
2010-02-04  3:39               ` Dave Chinner
2010-02-04  9:33                 ` Nick Piggin
2010-02-04 17:13                   ` Christoph Lameter
2010-02-08  7:37                     ` Nick Piggin
2010-02-08 17:40                       ` Christoph Lameter
2010-02-08 22:13                       ` Dave Chinner
2010-02-04 16:59                 ` Christoph Lameter
2010-02-06  0:39                   ` Dave Chinner
2010-01-31 21:02       ` tytso
2010-02-01 10:17         ` Andi Kleen
2010-02-01 13:47           ` tytso
2010-02-01 13:54             ` Andi Kleen
2010-01-29 20:49 ` Filesystem: Ext2 filesystem defrag Christoph Lameter
2010-01-29 20:49 ` Filesystem: Ext3 " Christoph Lameter
2010-01-29 20:49 ` Filesystem: Ext4 " Christoph Lameter
2010-01-29 20:49 ` Filesystem: XFS slab defragmentation Christoph Lameter
2010-01-29 20:49 ` Filesystems: /proc filesystem support for slab defrag Christoph Lameter
2010-01-29 20:49 ` dentries: dentry defragmentation Christoph Lameter
2010-01-29 22:00   ` Al Viro
2010-02-01  7:08     ` Nick Piggin
2010-02-01 10:10       ` Andi Kleen
2010-02-01 10:16         ` Nick Piggin
2010-02-01 10:22           ` Andi Kleen
2010-02-01 10:35             ` Nick Piggin
2010-02-01 10:45               ` Andi Kleen
2010-02-01 10:56                 ` Nick Piggin
2010-02-01 13:25                   ` Andi Kleen
2010-02-01 13:36                     ` Nick Piggin
2010-01-29 20:49 ` slub defrag: Transition patch upstream -> -next Christoph Lameter
2010-01-30  8:54 ` Slab Fragmentation Reduction V15 Pekka Enberg
2010-01-30 10:48 ` Andi Kleen
2010-01-30 14:53   ` Rik van Riel
2010-02-01 17:53     ` Christoph Lameter
2010-02-01 17:52   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox