* [patch 00/19] Slab Fragmentation Reduction V13
@ 2008-05-10 2:21 Christoph Lameter
2008-05-10 2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
` (19 more replies)
0 siblings, 20 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
mpm, Dave Chinner
V12->v13:
- Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
- Fix unitialized variable issue
Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim without much succes.
Slab defragmentation adds the capability to recover the memory that
is wasted.
Memory reclaim for the following slab caches is possible:
1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
filesystems than the currently supported ext2/3/4 reiserfs, XFS
and proc)
3. buffer_heads
One typical mechanism that triggers slab defragmentation on my systems
is the daily run of
updatedb
Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.
However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.
V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
use a timeout but track the amount of slab reclaim done by the shrinkers.
Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
/proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
defrag (Either of those could be theoretically equipped to support
slab defrag in some way but it seems that Andrew/Linus want to reduce
the number of slab allocators).
V10->V11
- Simplify determination when to reclaim: Just scan over all partials
and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
were calling into reclaim very frequently when the system was under
memory pressure which slowed things down. Various measures to
avoid scanning the partial list too frequently were added and the
earlier (expensive) method of determining the defrag ratio of the slab
cache as a whole was dropped. I think this addresses the issues that
Mel saw with V10.
V9->V10
- Rediff against upstream
V8->V9
- Rediff against 2.6.24-rc6-mm1
V7->V8
- Rediff against 2.6.24-rc3-mm2
V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
Targeted reclaim never triggers. This has to wait until we make
slabs movable or we need to perform a special version of lumpy reclaim
in SLUB while we scan the partial lists for slabs to kick out.
Removal simplifies handling significantly since we
get to slabs in a more controlled way via the partial lists.
The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.
V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
has to wait until this has been considered by Mel.
V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.
V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
improved by slab defragmentation.
V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
have slabs with a high degree of fragmentation.
V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
a new dentry flag that indicates that a dentry is not in the process
of being freed or allocated.
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 01/19] slub: Add defrag_ratio field and sysfs support.
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
` (18 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0001-SLUB-Add-defrag_ratio-field-and-sysfs-support.patch --]
[-- Type: text/plain, Size: 2560 bytes --]
The defrag_ratio is used to set the threshold at which defragmentation
should be attempted on a slab page.
The allocation ratio is measured by the percentage of the available slots
allocated.
Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
that less than 3 out of 10 available slots for objects are in use before
slab defragmeentation runs.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/slub_def.h | 7 +++++++
mm/slub.c | 23 +++++++++++++++++++++++
2 files changed, 30 insertions(+)
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2008-07-31 12:20:16.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2008-07-31 12:20:17.000000000 -0500
@@ -88,6 +88,13 @@
void (*ctor)(void *);
int inuse; /* Offset to metadata */
int align; /* Alignment */
+ int defrag_ratio; /*
+ * Ratio used to check the percentage of
+ * objects allocate in a slab page.
+ * If less than this ratio is allocated
+ * then reclaim attempts are made.
+ */
+
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */
#ifdef CONFIG_SLUB_DEBUG
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:20:16.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:20:17.000000000 -0500
@@ -2299,6 +2299,7 @@
goto error;
s->refcount = 1;
+ s->defrag_ratio = 30;
#ifdef CONFIG_NUMA
s->remote_node_defrag_ratio = 100;
#endif
@@ -4031,6 +4032,27 @@
}
SLAB_ATTR_RO(free_calls);
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ unsigned long ratio;
+ int err;
+
+ err = strict_strtoul(buf, 10, &ratio);
+ if (err)
+ return err;
+
+ if (ratio < 100)
+ s->defrag_ratio = ratio;
+ return length;
+}
+SLAB_ATTR(defrag_ratio);
+
#ifdef CONFIG_NUMA
static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
{
@@ -4138,6 +4160,7 @@
&shrink_attr.attr,
&alloc_calls_attr.attr,
&free_calls_attr.attr,
+ &defrag_ratio_attr.attr,
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/*
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
2008-05-10 2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 03/19] slub: Add get() and kick() methods Christoph Lameter
` (17 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0002-SLUB-Replace-ctor-field-with-ops-field-in-sys-slab.patch --]
[-- Type: text/plain, Size: 1463 bytes --]
Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that will
be defined soon.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/slub.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:19:51.000000000 -0500
@@ -3803,16 +3803,18 @@
}
SLAB_ATTR(order);
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
{
- if (s->ctor) {
- int n = sprint_symbol(buf, (unsigned long)s->ctor);
+ int x = 0;
- return n + sprintf(buf + n, "\n");
+ if (s->ctor) {
+ x += sprintf(buf + x, "ctor : ");
+ x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+ x += sprintf(buf + x, "\n");
}
- return 0;
+ return x;
}
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
static ssize_t aliases_show(struct kmem_cache *s, char *buf)
{
@@ -4145,7 +4147,7 @@
&slabs_attr.attr,
&partial_attr.attr,
&cpu_slabs_attr.attr,
- &ctor_attr.attr,
+ &ops_attr.attr,
&aliases_attr.attr,
&align_attr.attr,
&sanity_checks_attr.attr,
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 03/19] slub: Add get() and kick() methods
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
2008-05-10 2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
2008-05-10 2:21 ` [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
` (16 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0003-SLUB-Add-get-and-kick-methods.patch --]
[-- Type: text/plain, Size: 5458 bytes --]
Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.
Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the defrag methods for a slab cache.
Add empty functions for SLAB/SLOB. The API is generic so it
could be theoretically implemented for either allocator.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/slab.h | 50 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/slub_def.h | 3 ++
mm/slub.c | 29 ++++++++++++++++++++++++++-
3 files changed, 81 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2008-07-31 12:19:39.000000000 -0500
@@ -86,6 +86,9 @@
gfp_t allocflags; /* gfp flags to use on each alloc */
int refcount; /* Refcount for slab cache destroy */
void (*ctor)(void *);
+ kmem_defrag_get_func *get;
+ kmem_defrag_kick_func *kick;
+
int inuse; /* Offset to metadata */
int align; /* Alignment */
int defrag_ratio; /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:19:48.000000000 -0500
@@ -2736,6 +2736,19 @@
}
EXPORT_SYMBOL(kfree);
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+ kmem_defrag_get_func get, kmem_defrag_kick_func kick)
+{
+ /*
+ * Defragmentable slabs must have a ctor otherwise objects may be
+ * in an undetermined state after they are allocated.
+ */
+ BUG_ON(!s->ctor);
+ s->get = get;
+ s->kick = kick;
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
/*
* kmem_cache_shrink removes empty slabs from the partial lists and sorts
* the remaining slabs by the number of items in use. The slabs with the
@@ -3029,7 +3042,7 @@
if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
return 1;
- if (s->ctor)
+ if (s->ctor || s->kick || s->get)
return 1;
/*
@@ -3812,6 +3825,20 @@
x += sprint_symbol(buf + x, (unsigned long)s->ctor);
x += sprintf(buf + x, "\n");
}
+
+ if (s->get) {
+ x += sprintf(buf + x, "get : ");
+ x += sprint_symbol(buf + x,
+ (unsigned long)s->get);
+ x += sprintf(buf + x, "\n");
+ }
+
+ if (s->kick) {
+ x += sprintf(buf + x, "kick : ");
+ x += sprint_symbol(buf + x,
+ (unsigned long)s->kick);
+ x += sprintf(buf + x, "\n");
+ }
return x;
}
SLAB_ATTR_RO(ops);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h 2008-07-31 12:19:25.000000000 -0500
+++ linux-2.6/include/linux/slab.h 2008-07-31 12:19:45.000000000 -0500
@@ -102,6 +102,56 @@
size_t ksize(const void *);
/*
+ * Function prototypes passed to kmem_cache_defrag() to enable defragmentation
+ * and targeted reclaim in slab caches.
+ */
+
+/*
+ * kmem_cache_defrag_get_func() is called with locks held so that the slab
+ * objects cannot be freed. We are in an atomic context and no slab
+ * operations may be performed. The purpose of kmem_cache_defrag_get_func()
+ * is to obtain a stable refcount on the objects, so that they cannot be
+ * removed until kmem_cache_kick_func() has handled them.
+ *
+ * Parameters passed are the number of objects to process and an array of
+ * pointers to objects for which we need references.
+ *
+ * Returns a pointer that is passed to the kick function. If any objects
+ * cannot be moved then the pointer may indicate a failure and
+ * then kick can simply remove the references that were already obtained.
+ *
+ * The object pointer array passed is also passed to kmem_cache_defrag_kick().
+ * The function may remove objects from the array by setting pointers to
+ * NULL. This is useful if we can determine that an object is already about
+ * to be removed. In that case it is often impossible to obtain the necessary
+ * refcount.
+ */
+typedef void *kmem_defrag_get_func(struct kmem_cache *, int, void **);
+
+/*
+ * kmem_cache_defrag_kick_func is called with no locks held and interrupts
+ * enabled. Sleeping is possible. Any operation may be performed in kick().
+ * kmem_cache_defrag should free all the objects in the pointer array.
+ *
+ * Parameters passed are the number of objects in the array, the array of
+ * pointers to the objects and the pointer returned by kmem_cache_defrag_get().
+ *
+ * Success is checked by examining the number of remaining objects in the slab.
+ */
+typedef void kmem_defrag_kick_func(struct kmem_cache *, int, void **, void *);
+
+/*
+ * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
+ kmem_defrag_kick_func);
+#else
+static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
+ kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+#endif
+
+/*
* Allocator specific definitions. These are mainly used to establish optimized
* ways to convert kmalloc() calls to kmem_cache_alloc() invocations by
* selecting the appropriate general cache at compile time.
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (2 preceding siblings ...)
2008-05-10 2:21 ` [patch 03/19] slub: Add get() and kick() methods Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 05/19] slub: Slab defrag core Christoph Lameter
` (15 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0004-SLUB-Sort-slab-cache-list-and-establish-maximum-obj.patch --]
[-- Type: text/plain, Size: 2624 bytes --]
When defragmenting slabs then it is advantageous to have all
defragmentable slabs together at the beginning of the list so that there is
no need to scan the complete list. Put defragmentable caches first when adding
a slab cache and others last.
Determine the maximum number of objects in defragmentable slabs. This allows
to size the allocation of arrays holding refs to these objects later.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/slub.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:19:45.000000000 -0500
@@ -173,6 +173,9 @@
static DECLARE_RWSEM(slub_lock);
static LIST_HEAD(slab_caches);
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects;
+
/*
* Tracking user of a slab.
*/
@@ -2506,7 +2509,7 @@
flags, NULL))
goto panic;
- list_add(&s->list, &slab_caches);
+ list_add_tail(&s->list, &slab_caches);
up_write(&slub_lock);
if (sysfs_slab_add(s))
goto panic;
@@ -2736,9 +2739,23 @@
}
EXPORT_SYMBOL(kfree);
+/*
+ * Allocate a slab scratch space that is sufficient to keep at least
+ * max_defrag_slab_objects pointers to individual objects and also a bitmap
+ * for max_defrag_slab_objects.
+ */
+static inline void *alloc_scratch(void)
+{
+ return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+ BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+ GFP_KERNEL);
+}
+
void kmem_cache_setup_defrag(struct kmem_cache *s,
kmem_defrag_get_func get, kmem_defrag_kick_func kick)
{
+ int max_objects = oo_objects(s->max);
+
/*
* Defragmentable slabs must have a ctor otherwise objects may be
* in an undetermined state after they are allocated.
@@ -2746,6 +2763,11 @@
BUG_ON(!s->ctor);
s->get = get;
s->kick = kick;
+ down_write(&slub_lock);
+ list_move(&s->list, &slab_caches);
+ if (max_objects > max_defrag_slab_objects)
+ max_defrag_slab_objects = max_objects;
+ up_write(&slub_lock);
}
EXPORT_SYMBOL(kmem_cache_setup_defrag);
@@ -3131,7 +3153,7 @@
if (s) {
if (kmem_cache_open(s, GFP_KERNEL, name,
size, align, flags, ctor)) {
- list_add(&s->list, &slab_caches);
+ list_add_tail(&s->list, &slab_caches);
up_write(&slub_lock);
if (sysfs_slab_add(s))
goto err;
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 05/19] slub: Slab defrag core
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (3 preceding siblings ...)
2008-05-10 2:21 ` [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
` (14 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0005-SLUB-Slab-defrag-core.patch --]
[-- Type: text/plain, Size: 12916 bytes --]
Slab defragmentation may occur:
1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
kernel calling kmem_cache_shrink.
2. Through the use of the slabinfo command.
3. Per node defrag conditionally when kmem_cache_defrag(<node>) is called
(can be called from reclaim code with a later patch).
Defragmentation is only performed if the fragmentation of the slab
is lower than the specified percentage. Fragmentation ratios are measured
by calculating the percentage of objects in use compared to the total
number of objects that the slab page can accomodate.
The scanning of slab caches is optimized because the
defragmentable slabs come first on the list. Thus we can terminate scans
on the first slab encountered that does not support defragmentation.
kmem_cache_defrag() takes a node parameter. This can either be -1 if
defragmentation should be performed on all nodes, or a node number.
A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are
kmem_defrag_get_func (void *get(struct kmem_cache *s, int nr, void **objects))
Must obtain a reference to the listed objects. SLUB guarantees that
the objects are still allocated. However, other threads may be blocked
in slab_free() attempting to free objects in the slab. These may succeed
as soon as get() returns to the slab allocator. The function must
be able to detect such situations and void the attempts to free such
objects (by for example voiding the corresponding entry in the objects
array).
No slab operations may be performed in get(). Interrupts
are disabled. What can be done is very limited. The slab lock
for the page that contains the object is taken. Any attempt to perform
a slab operation may lead to a deadlock.
kmem_defrag_get_func returns a private pointer that is passed to
kmem_defrag_kick_func(). Should we be unable to obtain all references
then that pointer may indicate to the kick() function that it should
not attempt any object removal or move but simply remove the
reference counts.
kmem_defrag_kick_func (void kick(struct kmem_cache *, int nr, void **objects,
void *get_result))
After SLUB has established references to the objects in a
slab it will then drop all locks and use kick() to move objects out
of the slab. The existence of the object is guaranteed by virtue of
the earlier obtained references via kmem_defrag_get_func(). The
callback may perform any slab operation since no locks are held at
the time of call.
The callback should remove the object from the slab in some way. This
may be accomplished by reclaiming the object and then running
kmem_cache_free() or reallocating it and then running
kmem_cache_free(). Reallocation is advantageous because the partial
slabs were just sorted to have the partial slabs with the most objects
first. Reallocation is likely to result in filling up a slab in
addition to freeing up one slab. A filled up slab can also be removed
from the partial list. So there could be a double effect.
kmem_defrag_kick_func() does not return a result. SLUB will check
the number of remaining objects in the slab. If all objects were
removed then the operation was successful.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/slab.h | 3
mm/slub.c | 265 ++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 215 insertions(+), 53 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:19:42.000000000 -0500
@@ -127,10 +127,10 @@
/*
* Maximum number of desirable partial slabs.
- * The existence of more partial slabs makes kmem_cache_shrink
- * sort the partial list by the number of objects in the.
+ * More slabs cause kmem_cache_shrink to sort the slabs by objects
+ * and triggers slab defragmentation.
*/
-#define MAX_PARTIAL 10
+#define MAX_PARTIAL 20
#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)
@@ -2772,76 +2772,235 @@
EXPORT_SYMBOL(kmem_cache_setup_defrag);
/*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * Vacate all objects in the given slab.
*
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
*/
-int kmem_cache_shrink(struct kmem_cache *s)
+static int kmem_cache_vacate(struct page *page, void *scratch)
{
- int node;
- int i;
- struct kmem_cache_node *n;
- struct page *page;
- struct page *t;
- int objects = oo_objects(s->max);
- struct list_head *slabs_by_inuse =
- kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+ void **vector = scratch;
+ void *p;
+ void *addr = page_address(page);
+ struct kmem_cache *s;
+ unsigned long *map;
+ int leftover;
+ int count;
+ void *private;
unsigned long flags;
+ unsigned long objects;
- if (!slabs_by_inuse)
- return -ENOMEM;
+ local_irq_save(flags);
+ slab_lock(page);
- flush_all(s);
- for_each_node_state(node, N_NORMAL_MEMORY) {
- n = get_node(s, node);
+ BUG_ON(!PageSlab(page)); /* Must be s slab page */
+ BUG_ON(!SlabFrozen(page)); /* Slab must have been frozen earlier */
+
+ s = page->slab;
+ objects = page->objects;
+ map = scratch + objects * sizeof(void **);
+ if (!page->inuse || !s->kick)
+ goto out;
+
+ /* Determine used objects */
+ bitmap_fill(map, objects);
+ for_each_free_object(p, s, page->freelist)
+ __clear_bit(slab_index(p, s, addr), map);
+
+ /* Build vector of pointers to objects */
+ count = 0;
+ memset(vector, 0, objects * sizeof(void **));
+ for_each_object(p, s, addr, objects)
+ if (test_bit(slab_index(p, s, addr), map))
+ vector[count++] = p;
+
+ private = s->get(s, count, vector);
+
+ /*
+ * Got references. Now we can drop the slab lock. The slab
+ * is frozen so it cannot vanish from under us nor will
+ * allocations be performed on the slab. However, unlocking the
+ * slab will allow concurrent slab_frees to proceed.
+ */
+ slab_unlock(page);
+ local_irq_restore(flags);
+
+ /*
+ * Perform the KICK callbacks to remove the objects.
+ */
+ s->kick(s, count, vector, private);
+
+ local_irq_save(flags);
+ slab_lock(page);
+out:
+ /*
+ * Check the result and unfreeze the slab
+ */
+ leftover = page->inuse;
+ unfreeze_slab(s, page, leftover > 0);
+ local_irq_restore(flags);
+ return leftover;
+}
+
+/*
+ * Remove objects from a list of slab pages that have been gathered.
+ * Must be called with slabs that have been isolated before.
+ *
+ * kmem_cache_reclaim() is never called from an atomic context. It
+ * allocates memory for temporary storage. We are holding the
+ * slub_lock semaphore which prevents another call into
+ * the defrag logic.
+ */
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+ int freed = 0;
+ void **scratch;
+ struct page *page;
+ struct page *page2;
+
+ if (list_empty(zaplist))
+ return 0;
+
+ scratch = alloc_scratch();
+ if (!scratch)
+ return 0;
+
+ list_for_each_entry_safe(page, page2, zaplist, lru) {
+ list_del(&page->lru);
+ if (kmem_cache_vacate(page, scratch) == 0)
+ freed++;
+ }
+ kfree(scratch);
+ return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ * by releasing slabs with zero objects and trying to reclaim
+ * slabs with less than the configured percentage of objects allocated.
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s, int node,
+ unsigned long limit)
+{
+ unsigned long flags;
+ struct page *page, *page2;
+ LIST_HEAD(zaplist);
+ int freed = 0;
+ struct kmem_cache_node *n = get_node(s, node);
- if (!n->nr_partial)
+ if (n->nr_partial <= limit)
+ return 0;
+
+ spin_lock_irqsave(&n->list_lock, flags);
+ list_for_each_entry_safe(page, page2, &n->partial, lru) {
+ if (!slab_trylock(page))
+ /* Busy slab. Get out of the way */
continue;
- for (i = 0; i < objects; i++)
- INIT_LIST_HEAD(slabs_by_inuse + i);
+ if (page->inuse) {
+ if (page->inuse * 100 >=
+ s->defrag_ratio * page->objects) {
+ slab_unlock(page);
+ /* Slab contains enough objects */
+ continue;
+ }
- spin_lock_irqsave(&n->list_lock, flags);
+ list_move(&page->lru, &zaplist);
+ if (s->kick) {
+ n->nr_partial--;
+ SetSlabFrozen(page);
+ }
+ slab_unlock(page);
+ } else {
+ /* Empty slab page */
+ list_del(&page->lru);
+ n->nr_partial--;
+ slab_unlock(page);
+ discard_slab(s, page);
+ freed++;
+ }
+ }
+ if (!s->kick)
/*
- * Build lists indexed by the items in use in each slab.
+ * No defrag methods. By simply putting the zaplist at the
+ * end of the partial list we can let them simmer longer
+ * and thus increase the chance of all objects being
+ * reclaimed.
*
- * Note that concurrent frees may occur while we hold the
- * list_lock. page->inuse here is the upper limit.
+ * We have effectively sorted the partial list and put
+ * the slabs with more objects first. As soon as they
+ * are allocated they are going to be removed from the
+ * partial list.
*/
- list_for_each_entry_safe(page, t, &n->partial, lru) {
- if (!page->inuse && slab_trylock(page)) {
- /*
- * Must hold slab lock here because slab_free
- * may have freed the last object and be
- * waiting to release the slab.
- */
- list_del(&page->lru);
- n->nr_partial--;
- slab_unlock(page);
- discard_slab(s, page);
- } else {
- list_move(&page->lru,
- slabs_by_inuse + page->inuse);
- }
- }
+ list_splice(&zaplist, n->partial.prev);
+
+
+ spin_unlock_irqrestore(&n->list_lock, flags);
+
+ if (s->kick)
+ freed += kmem_cache_reclaim(&zaplist);
+
+ return freed;
+}
+
+/*
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ */
+int kmem_cache_defrag(int node)
+{
+ struct kmem_cache *s;
+ unsigned long slabs = 0;
+
+ /*
+ * kmem_cache_defrag may be called from the reclaim path which may be
+ * called for any page allocator alloc. So there is the danger that we
+ * get called in a situation where slub already acquired the slub_lock
+ * for other purposes.
+ */
+ if (!down_read_trylock(&slub_lock))
+ return 0;
+
+ list_for_each_entry(s, &slab_caches, list) {
+ unsigned long reclaimed = 0;
/*
- * Rebuild the partial list with the slabs filled up most
- * first and the least used slabs at the end.
+ * Defragmentable caches come first. If the slab cache is not
+ * defragmentable then we can stop traversing the list.
*/
- for (i = objects - 1; i >= 0; i--)
- list_splice(slabs_by_inuse + i, n->partial.prev);
+ if (!s->kick)
+ break;
- spin_unlock_irqrestore(&n->list_lock, flags);
+ if (node == -1) {
+ int nid;
+
+ for_each_node_state(nid, N_NORMAL_MEMORY)
+ reclaimed += __kmem_cache_shrink(s, nid,
+ MAX_PARTIAL);
+ } else
+ reclaimed = __kmem_cache_shrink(s, node, MAX_PARTIAL);
+
+ slabs += reclaimed;
}
+ up_read(&slub_lock);
+ return slabs;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists.
+ * If the slab cache supports defragmentation then objects are
+ * reclaimed.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+ int node;
+
+ flush_all(s);
+ for_each_node_state(node, N_NORMAL_MEMORY)
+ __kmem_cache_shrink(s, node, 0);
- kfree(slabs_by_inuse);
return 0;
}
EXPORT_SYMBOL(kmem_cache_shrink);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/include/linux/slab.h 2008-07-31 12:19:28.000000000 -0500
@@ -142,13 +142,16 @@
/*
* kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ * kmem_cache_defrag() performs the actual defragmentation.
*/
#ifdef CONFIG_SLUB
void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
kmem_defrag_kick_func);
+int kmem_cache_defrag(int node);
#else
static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+static inline int kmem_cache_defrag(int node) { return 0; }
#endif
/*
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (4 preceding siblings ...)
2008-05-10 2:21 ` [patch 05/19] slub: Slab defrag core Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 07/19] slub: Extend slabinfo to support -D and -F options Christoph Lameter
` (13 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0006-SLUB-Add-KICKABLE-to-avoid-repeated-kick-attempts.patch --]
[-- Type: text/plain, Size: 3530 bytes --]
Add a flag KICKABLE to be set on slabs with a defragmentation method
Clear the flag if a kick action is not successful in reducing the
number of objects in a slab. This will avoid future attempts to
kick objects out.
The KICKABLE flag is set again when all objects of the slab have been
allocated (Occurs during removal of a slab from the partial lists).
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/slub.c | 35 ++++++++++++++++++++++++++++++++---
1 file changed, 32 insertions(+), 3 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:19:28.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:19:39.000000000 -0500
@@ -1130,6 +1130,9 @@
SLAB_STORE_USER | SLAB_TRACE))
__SetPageSlubDebug(page);
+ if (s->kick)
+ __SetPageSlubKickable(page);
+
start = page_address(page);
if (unlikely(s->flags & SLAB_POISON))
@@ -1170,6 +1173,7 @@
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-pages);
+ __ClearPageSlubKickable(page);
__ClearPageSlab(page);
reset_page_mapcount(page);
__free_pages(page, order);
@@ -1380,6 +1384,8 @@
if (SLABDEBUG && PageSlubDebug(page) &&
(s->flags & SLAB_STORE_USER))
add_full(n, page);
+ if (s->kick)
+ __SetPageSlubKickable(page);
}
slab_unlock(page);
} else {
@@ -2795,12 +2801,12 @@
slab_lock(page);
BUG_ON(!PageSlab(page)); /* Must be s slab page */
- BUG_ON(!SlabFrozen(page)); /* Slab must have been frozen earlier */
+ BUG_ON(!PageSlubFrozen(page)); /* Slab must have been frozen earlier */
s = page->slab;
objects = page->objects;
map = scratch + objects * sizeof(void **);
- if (!page->inuse || !s->kick)
+ if (!page->inuse || !s->kick || !PageSlubKickable(page))
goto out;
/* Determine used objects */
@@ -2838,6 +2844,9 @@
* Check the result and unfreeze the slab
*/
leftover = page->inuse;
+ if (leftover)
+ /* Unsuccessful reclaim. Avoid future reclaim attempts. */
+ __ClearPageSlubKickable(page);
unfreeze_slab(s, page, leftover > 0);
local_irq_restore(flags);
return leftover;
@@ -2899,17 +2908,21 @@
continue;
if (page->inuse) {
- if (page->inuse * 100 >=
+ if (!PageSlubKickable(page) || page->inuse * 100 >=
s->defrag_ratio * page->objects) {
slab_unlock(page);
- /* Slab contains enough objects */
+ /*
+ * Slab contains enough objects
+ * or we alrady tried reclaim before and
+ * it failed. Skip this one.
+ */
continue;
}
list_move(&page->lru, &zaplist);
if (s->kick) {
n->nr_partial--;
- SetSlabFrozen(page);
+ __SetPageSlubFrozen(page);
}
slab_unlock(page);
} else {
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-07-31 12:19:25.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h 2008-07-31 12:19:28.000000000 -0500
@@ -112,6 +112,7 @@
/* SLUB */
PG_slub_frozen = PG_active,
PG_slub_debug = PG_error,
+ PG_slub_kickable = PG_dirty,
};
#ifndef __GENERATING_BOUNDS_H
@@ -182,6 +183,7 @@
__PAGEFLAG(SlubFrozen, slub_frozen)
__PAGEFLAG(SlubDebug, slub_debug)
+__PAGEFLAG(SlubKickable, slub_kickable)
/*
* Only test-and-set exist for PG_writeback. The unconditional operators are
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 07/19] slub: Extend slabinfo to support -D and -F options
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (5 preceding siblings ...)
2008-05-10 2:21 ` [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 08/19] slub/slabinfo: add defrag statistics Christoph Lameter
` (12 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0007-SLUB-Extend-slabinfo-to-support-D-and-F-options.patch --]
[-- Type: text/plain, Size: 5707 bytes --]
-F lists caches that support defragmentation
-C lists caches that use a ctor.
Change field names for defrag_ratio and remote_node_defrag_ratio.
Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
Documentation/vm/slabinfo.c | 48 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 43 insertions(+), 5 deletions(-)
Index: linux-next/Documentation/vm/slabinfo.c
===================================================================
--- linux-next.orig/Documentation/vm/slabinfo.c 2008-07-09 09:06:12.000000000 -0500
+++ linux-next/Documentation/vm/slabinfo.c 2008-07-09 09:33:37.000000000 -0500
@@ -31,6 +31,8 @@
int hwcache_align, object_size, objs_per_slab;
int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+ int defrag, ctor;
+ int defrag_ratio, remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -64,6 +66,8 @@
int skip_zero = 1;
int show_numa = 0;
int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
int show_first_alias = 0;
int validate = 0;
int shrink = 0;
@@ -100,13 +104,15 @@
void usage(void)
{
printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
- "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+ "slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n"
"-a|--aliases Show aliases\n"
"-A|--activity Most active slabs first\n"
"-d<options>|--debug=<options> Set/Clear Debug options\n"
+ "-C|--ctor Show slabs with ctors\n"
"-D|--display-active Switch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias Show first alias\n"
+ "-F|--defrag Show defragmentable caches\n"
"-h|--help Show usage information\n"
"-i|--inverted Inverted list\n"
"-l|--slabs Show slabs\n"
@@ -296,7 +302,7 @@
printf("Name Objects Alloc Free %%Fast Fallb O\n");
else
printf("Name Objects Objsize Space "
- "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n");
+ "Slabs/Part/Cpu O/S O %%Ra %%Ef Flg\n");
}
/*
@@ -345,7 +351,7 @@
return;
if (!line) {
- printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+ printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
printf(" %4d", node);
printf("\n----------------------");
@@ -354,6 +360,7 @@
printf("\n");
}
printf("%-21s ", mode ? "All slabs" : s->name);
+ printf("%3d ", s->remote_node_defrag_ratio);
for(node = 0; node <= highest_node; node++) {
char b[20];
@@ -492,6 +499,8 @@
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+ if (s->defrag)
+ printf("** Defragmentation at %d%%\n", s->defrag_ratio);
printf("\nSizes (bytes) Slabs Debug Memory\n");
printf("------------------------------------------------------------------------\n");
@@ -539,6 +548,12 @@
if (show_empty && s->slabs)
return;
+ if (show_defrag && !s->defrag)
+ return;
+
+ if (show_ctor && !s->ctor)
+ return;
+
store_size(size_str, slab_size(s));
snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
s->partial, s->cpu_slabs);
@@ -550,6 +565,10 @@
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+ if (s->defrag)
+ *p++ = 'F';
+ if (s->ctor)
+ *p++ = 'C';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -584,7 +603,8 @@
printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
- s->slabs ? (s->partial * 100) / s->slabs : 100,
+ s->slabs ? (s->partial * 100) /
+ (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1190,7 +1210,17 @@
slab->deactivate_to_tail = get_obj("deactivate_to_tail");
slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
slab->order_fallback = get_obj("order_fallback");
+ slab->defrag_ratio = get_obj("defrag_ratio");
+ slab->remote_node_defrag_ratio =
+ get_obj("remote_node_defrag_ratio");
chdir("..");
+ if (read_slab_obj(slab, "ops")) {
+ if (strstr(buffer, "ctor :"))
+ slab->ctor = 1;
+ if (strstr(buffer, "kick :"))
+ slab->defrag = 1;
+ }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1241,10 +1271,12 @@
struct option opts[] = {
{ "aliases", 0, NULL, 'a' },
{ "activity", 0, NULL, 'A' },
+ { "ctor", 0, NULL, 'C' },
{ "debug", 2, NULL, 'd' },
{ "display-activity", 0, NULL, 'D' },
{ "empty", 0, NULL, 'e' },
{ "first-alias", 0, NULL, 'f' },
+ { "defrag", 0, NULL, 'F' },
{ "help", 0, NULL, 'h' },
{ "inverted", 0, NULL, 'i'},
{ "numa", 0, NULL, 'n' },
@@ -1267,7 +1299,7 @@
page_size = getpagesize();
- while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
+ while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
opts, NULL)) != -1)
switch (c) {
case '1':
@@ -1323,6 +1355,12 @@
case 'z':
skip_zero = 0;
break;
+ case 'C':
+ show_ctor = 1;
+ break;
+ case 'F':
+ show_defrag = 1;
+ break;
case 'T':
show_totals = 1;
break;
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 08/19] slub/slabinfo: add defrag statistics
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (6 preceding siblings ...)
2008-05-10 2:21 ` [patch 07/19] slub: Extend slabinfo to support -D and -F options Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 09/19] slub: Trigger defragmentation from memory reclaim Christoph Lameter
` (11 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0008-slub-add-defrag-statistics.patch --]
[-- Type: text/plain, Size: 8843 bytes --]
Add statistics counters for slab defragmentation.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
Documentation/vm/slabinfo.c | 45 ++++++++++++++++++++++++++++++++++++--------
include/linux/slub_def.h | 6 +++++
mm/slub.c | 29 ++++++++++++++++++++++++++--
3 files changed, 70 insertions(+), 10 deletions(-)
Index: linux-2.6/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.orig/Documentation/vm/slabinfo.c 2008-07-31 12:18:58.000000000 -0500
+++ linux-2.6/Documentation/vm/slabinfo.c 2008-07-31 12:18:58.000000000 -0500
@@ -41,6 +41,9 @@
unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
unsigned long deactivate_to_head, deactivate_to_tail;
unsigned long deactivate_remote_frees, order_fallback;
+ unsigned long shrink_calls, shrink_attempt_defrag, shrink_empty_slab;
+ unsigned long shrink_slab_skipped, shrink_slab_reclaimed;
+ unsigned long shrink_object_reclaim_failed;
int numa[MAX_NODES];
int numa_partial[MAX_NODES];
} slabinfo[MAX_SLABS];
@@ -79,6 +82,7 @@
int set_debug = 0;
int show_ops = 0;
int show_activity = 0;
+int show_defragcount = 0;
/* Debug options */
int sanity = 0;
@@ -113,6 +117,7 @@
"-e|--empty Show empty slabs\n"
"-f|--first-alias Show first alias\n"
"-F|--defrag Show defragmentable caches\n"
+ "-G:--display-defrag Display defrag counters\n"
"-h|--help Show usage information\n"
"-i|--inverted Inverted list\n"
"-l|--slabs Show slabs\n"
@@ -300,6 +305,8 @@
{
if (show_activity)
printf("Name Objects Alloc Free %%Fast Fallb O\n");
+ else if (show_defragcount)
+ printf("Name Objects DefragRQ Slabs Success Empty Skipped Failed\n");
else
printf("Name Objects Objsize Space "
"Slabs/Part/Cpu O/S O %%Ra %%Ef Flg\n");
@@ -466,22 +473,28 @@
printf("Total %8lu %8lu\n\n", total_alloc, total_free);
- if (s->cpuslab_flush)
- printf("Flushes %8lu\n", s->cpuslab_flush);
-
- if (s->alloc_refill)
- printf("Refill %8lu\n", s->alloc_refill);
+ if (s->cpuslab_flush || s->alloc_refill)
+ printf("CPU Slab : Flushes=%lu Refills=%lu\n",
+ s->cpuslab_flush, s->alloc_refill);
total = s->deactivate_full + s->deactivate_empty +
s->deactivate_to_head + s->deactivate_to_tail;
if (total)
- printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
+ printf("Deactivate: Full=%lu(%lu%%) Empty=%lu(%lu%%) "
"ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
s->deactivate_full, (s->deactivate_full * 100) / total,
s->deactivate_empty, (s->deactivate_empty * 100) / total,
s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
+
+ if (s->shrink_calls)
+ printf("Shrink : Calls=%lu Attempts=%lu Empty=%lu Successful=%lu\n",
+ s->shrink_calls, s->shrink_attempt_defrag,
+ s->shrink_empty_slab, s->shrink_slab_reclaimed);
+ if (s->shrink_slab_skipped || s->shrink_object_reclaim_failed)
+ printf("Defrag : Slabs skipped=%lu Object reclaim failed=%lu\n",
+ s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
}
void report(struct slabinfo *s)
@@ -598,7 +611,12 @@
total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
total_free ? (s->free_fastpath * 100 / total_free) : 0,
s->order_fallback, s->order);
- }
+ } else
+ if (show_defragcount)
+ printf("%-21s %8ld %7d %7d %7d %7d %7d %7d\n",
+ s->name, s->objects, s->shrink_calls, s->shrink_attempt_defrag,
+ s->shrink_slab_reclaimed, s->shrink_empty_slab,
+ s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
else
printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
@@ -1210,6 +1228,13 @@
slab->deactivate_to_tail = get_obj("deactivate_to_tail");
slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
slab->order_fallback = get_obj("order_fallback");
+ slab->shrink_calls = get_obj("shrink_calls");
+ slab->shrink_attempt_defrag = get_obj("shrink_attempt_defrag");
+ slab->shrink_empty_slab = get_obj("shrink_empty_slab");
+ slab->shrink_slab_skipped = get_obj("shrink_slab_skipped");
+ slab->shrink_slab_reclaimed = get_obj("shrink_slab_reclaimed");
+ slab->shrink_object_reclaim_failed =
+ get_obj("shrink_object_reclaim_failed");
slab->defrag_ratio = get_obj("defrag_ratio");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
@@ -1274,6 +1299,7 @@
{ "ctor", 0, NULL, 'C' },
{ "debug", 2, NULL, 'd' },
{ "display-activity", 0, NULL, 'D' },
+ { "display-defrag", 0, NULL, 'G' },
{ "empty", 0, NULL, 'e' },
{ "first-alias", 0, NULL, 'f' },
{ "defrag", 0, NULL, 'F' },
@@ -1299,7 +1325,7 @@
page_size = getpagesize();
- while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
+ while ((c = getopt_long(argc, argv, "aACd::DefFGhil1noprstvzTS",
opts, NULL)) != -1)
switch (c) {
case '1':
@@ -1325,6 +1351,9 @@
case 'f':
show_first_alias = 1;
break;
+ case 'G':
+ show_defragcount = 1;
+ break;
case 'h':
usage();
return 0;
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2008-07-31 12:18:58.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2008-07-31 12:18:58.000000000 -0500
@@ -30,6 +30,12 @@
DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */
DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
ORDER_FALLBACK, /* Number of times fallback was necessary */
+ SHRINK_CALLS, /* Number of invocations of kmem_cache_shrink */
+ SHRINK_ATTEMPT_DEFRAG, /* Slabs that were attempted to be reclaimed */
+ SHRINK_EMPTY_SLAB, /* Shrink encountered and freed empty slab */
+ SHRINK_SLAB_SKIPPED, /* Slab reclaim skipped an slab (busy etc) */
+ SHRINK_SLAB_RECLAIMED, /* Successfully reclaimed slabs */
+ SHRINK_OBJECT_RECLAIM_FAILED, /* Callbacks signaled busy objects */
NR_SLUB_STAT_ITEMS };
struct kmem_cache_cpu {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-07-31 12:18:58.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-07-31 12:18:58.000000000 -0500
@@ -2796,6 +2796,7 @@
void *private;
unsigned long flags;
unsigned long objects;
+ struct kmem_cache_cpu *c;
local_irq_save(flags);
slab_lock(page);
@@ -2844,9 +2845,13 @@
* Check the result and unfreeze the slab
*/
leftover = page->inuse;
- if (leftover)
+ c = get_cpu_slab(s, smp_processor_id());
+ if (leftover) {
/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+ stat(c, SHRINK_OBJECT_RECLAIM_FAILED);
__ClearPageSlubKickable(page);
+ } else
+ stat(c, SHRINK_SLAB_RECLAIMED);
unfreeze_slab(s, page, leftover > 0);
local_irq_restore(flags);
return leftover;
@@ -2897,11 +2902,14 @@
LIST_HEAD(zaplist);
int freed = 0;
struct kmem_cache_node *n = get_node(s, node);
+ struct kmem_cache_cpu *c;
if (n->nr_partial <= limit)
return 0;
spin_lock_irqsave(&n->list_lock, flags);
+ c = get_cpu_slab(s, smp_processor_id());
+ stat(c, SHRINK_CALLS);
list_for_each_entry_safe(page, page2, &n->partial, lru) {
if (!slab_trylock(page))
/* Busy slab. Get out of the way */
@@ -2921,12 +2929,14 @@
list_move(&page->lru, &zaplist);
if (s->kick) {
+ stat(c, SHRINK_ATTEMPT_DEFRAG);
n->nr_partial--;
__SetPageSlubFrozen(page);
}
slab_unlock(page);
} else {
/* Empty slab page */
+ stat(c, SHRINK_EMPTY_SLAB);
list_del(&page->lru);
n->nr_partial--;
slab_unlock(page);
@@ -4355,6 +4365,12 @@
STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(SHRINK_CALLS, shrink_calls);
+STAT_ATTR(SHRINK_ATTEMPT_DEFRAG, shrink_attempt_defrag);
+STAT_ATTR(SHRINK_EMPTY_SLAB, shrink_empty_slab);
+STAT_ATTR(SHRINK_SLAB_SKIPPED, shrink_slab_skipped);
+STAT_ATTR(SHRINK_SLAB_RECLAIMED, shrink_slab_reclaimed);
+STAT_ATTR(SHRINK_OBJECT_RECLAIM_FAILED, shrink_object_reclaim_failed);
#endif
static struct attribute *slab_attrs[] = {
@@ -4409,6 +4425,12 @@
&deactivate_to_tail_attr.attr,
&deactivate_remote_frees_attr.attr,
&order_fallback_attr.attr,
+ &shrink_calls_attr.attr,
+ &shrink_attempt_defrag_attr.attr,
+ &shrink_empty_slab_attr.attr,
+ &shrink_slab_skipped_attr.attr,
+ &shrink_slab_reclaimed_attr.attr,
+ &shrink_object_reclaim_failed_attr.attr,
#endif
NULL
};
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 09/19] slub: Trigger defragmentation from memory reclaim
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (7 preceding siblings ...)
2008-05-10 2:21 ` [patch 08/19] slub/slabinfo: add defrag statistics Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 10/19] buffer heads: Support slab defrag Christoph Lameter
` (10 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0009-SLUB-Trigger-defragmentation-from-memory-reclaim.patch --]
[-- Type: text/plain, Size: 10466 bytes --]
This patch triggers slab defragmentation from memory reclaim. The logical
point for this is after slab shrinking was performed in vmscan.c. At that point
the fragmentation ratio of a slab was increased because objects were freed via
the LRU lists maitained for various slab caches.
So we call kmem_cache_defrag() from there.
shrink_slab() is called in some contexts to do global shrinking
of slabs and in others to do shrinking for a particular zone. Pass the zone to
shrink_slab(), so that slab_shrink() can call kmem_cache_defrag() and restrict
the defragmentation to the node that is under memory pressure.
The callback frequency into slab reclaim can be controlled by a new field
/proc/sys/vm/slab_defrag_limit.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
Documentation/sysctl/vm.txt | 12 ++++++++
fs/drop_caches.c | 2 -
include/linux/mm.h | 3 --
include/linux/mmzone.h | 1
include/linux/swap.h | 3 ++
kernel/sysctl.c | 20 +++++++++++++
mm/vmscan.c | 65 +++++++++++++++++++++++++++++++++++++++-----
mm/vmstat.c | 2 +
8 files changed, 98 insertions(+), 10 deletions(-)
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/fs/drop_caches.c 2008-07-31 12:18:58.000000000 -0500
@@ -58,7 +58,7 @@
int nr_objects;
do {
- nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+ nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
} while (nr_objects > 10);
}
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/include/linux/mm.h 2008-07-31 12:18:58.000000000 -0500
@@ -1283,8 +1283,7 @@
int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
- unsigned long lru_pages);
-
+ unsigned long lru_pages, struct zone *z);
#ifndef CONFIG_MMU
#define randomize_va_space 0
#else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/mm/vmscan.c 2008-07-31 12:18:58.000000000 -0500
@@ -150,6 +150,14 @@
EXPORT_SYMBOL(unregister_shrinker);
#define SHRINK_BATCH 128
+
+/*
+ * Trigger a call into slab defrag if the sum of the returns from
+ * shrinkers cross this value.
+ */
+int slab_defrag_limit = 1000;
+int slab_defrag_counter;
+
/*
* Call the shrink functions to age shrinkable caches
*
@@ -167,10 +175,18 @@
* are eligible for the caller's allocation attempt. It is used for balancing
* slab reclaim versus page reclaim.
*
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
* Returns the number of slab objects which we shrunk.
*/
unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
- unsigned long lru_pages)
+ unsigned long lru_pages, struct zone *zone)
{
struct shrinker *shrinker;
unsigned long ret = 0;
@@ -227,6 +243,39 @@
shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
+
+
+ /* Avoid dirtying cachelines */
+ if (!ret)
+ return 0;
+
+ /*
+ * "ret" doesnt really contain the freed object count. The shrinkers
+ * fake it. Gotta go with what we are getting though.
+ *
+ * Handling of the defrag_counter is also racy. If we get the
+ * wrong counts then we may unnecessarily do a defrag pass or defer
+ * one. "ret" is already faked. So this is just increasing
+ * the already existing fuzziness to get some notion as to when
+ * to initiate slab defrag which will hopefully be okay.
+ */
+ if (zone) {
+ /* balance_pgdat running on a zone so we only scan one node */
+ zone->slab_defrag_counter += ret;
+ if (zone->slab_defrag_counter > slab_defrag_limit &&
+ (gfp_mask & __GFP_FS)) {
+ zone->slab_defrag_counter = 0;
+ kmem_cache_defrag(zone_to_nid(zone));
+ }
+ } else {
+ /* Direct (and thus global) reclaim. Scan all nodes */
+ slab_defrag_counter += ret;
+ if (slab_defrag_counter > slab_defrag_limit &&
+ (gfp_mask & __GFP_FS)) {
+ slab_defrag_counter = 0;
+ kmem_cache_defrag(-1);
+ }
+ }
return ret;
}
@@ -1379,7 +1428,7 @@
* over limit cgroups
*/
if (scan_global_lru(sc)) {
- shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
+ shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages, NULL);
if (reclaim_state) {
nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -1606,7 +1655,7 @@
nr_reclaimed += shrink_zone(priority, zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
- lru_pages);
+ lru_pages, zone);
nr_reclaimed += reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
if (zone_is_all_unreclaimable(zone))
@@ -1845,7 +1894,7 @@
/* If slab caches are huge, it's better to hit them first */
while (nr_slab >= lru_pages) {
reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+ shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
if (!reclaim_state.reclaimed_slab)
break;
@@ -1883,7 +1932,7 @@
reclaim_state.reclaimed_slab = 0;
shrink_slab(sc.nr_scanned, sc.gfp_mask,
- count_lru_pages());
+ count_lru_pages(), NULL);
ret += reclaim_state.reclaimed_slab;
if (ret >= nr_pages)
goto out;
@@ -1900,7 +1949,7 @@
if (!ret) {
do {
reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+ shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages(), NULL);
ret += reclaim_state.reclaimed_slab;
} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
}
@@ -2062,7 +2111,8 @@
* Note that shrink_slab will free memory on all zones and may
* take a long time.
*/
- while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+ while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+ zone) &&
zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
slab_reclaimable - nr_pages)
;
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h 2008-07-31 12:18:58.000000000 -0500
@@ -256,6 +256,7 @@
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long pages_scanned; /* since last reclaim */
+ unsigned long slab_defrag_counter; /* since last defrag */
unsigned long flags; /* zone flags, see below */
/* Zone statistics */
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/include/linux/swap.h 2008-07-31 12:18:58.000000000 -0500
@@ -188,6 +188,9 @@
extern int __isolate_lru_page(struct page *page, int mode);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+extern int slab_defrag_limit;
+extern int slab_defrag_counter;
+
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/kernel/sysctl.c 2008-07-31 12:18:58.000000000 -0500
@@ -1071,6 +1071,26 @@
.strategy = &sysctl_intvec,
.extra1 = &zero,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "slab_defrag_limit",
+ .data = &slab_defrag_limit,
+ .maxlen = sizeof(slab_defrag_limit),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one_hundred,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "slab_defrag_count",
+ .data = &slab_defrag_counter,
+ .maxlen = sizeof(slab_defrag_counter),
+ .mode = 0444,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
{
.ctl_name = VM_LEGACY_VA_LAYOUT,
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/Documentation/sysctl/vm.txt 2008-07-31 12:18:58.000000000 -0500
@@ -38,6 +38,7 @@
- numa_zonelist_order
- nr_hugepages
- nr_overcommit_hugepages
+- slab_defrag_limit
==============================================================
@@ -347,3 +348,14 @@
nr_hugepages + nr_overcommit_hugepages.
See Documentation/vm/hugetlbpage.txt
+
+==============================================================
+
+slab_defrag_limit
+
+Determines the frequency of calls from reclaim into slab defragmentation.
+Slab defrag reclaims objects from sparsely populates slab pages.
+The default is 1000. Increase if slab defragmentation occurs
+too frequently. Decrease if more slab defragmentation passes
+are needed. The slabinfo tool can report on the frequency of the callbacks.
+
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/mm/vmstat.c 2008-07-31 12:18:58.000000000 -0500
@@ -714,10 +714,12 @@
#endif
}
seq_printf(m,
+ "\n slab_defrag_count: %lu"
"\n all_unreclaimable: %u"
"\n prev_priority: %i"
"\n start_pfn: %lu",
- zone_is_all_unreclaimable(zone),
+ zone->slab_defrag_counter,
+ zone_is_all_unreclaimable(zone),
zone->prev_priority,
zone->zone_start_pfn);
seq_putc(m, '\n');
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 10/19] buffer heads: Support slab defrag
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (8 preceding siblings ...)
2008-05-10 2:21 ` [patch 09/19] slub: Trigger defragmentation from memory reclaim Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
` (9 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0024-Buffer-heads-Support-slab-defrag.patch --]
[-- Type: text/plain, Size: 3219 bytes --]
Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/buffer.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2008-07-31 12:18:56.000000000 -0500
+++ linux-2.6/fs/buffer.c 2008-07-31 12:18:59.000000000 -0500
@@ -3316,6 +3316,104 @@
}
EXPORT_SYMBOL(bh_submit_read);
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+ int rc;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = 1,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nonblocking = 1,
+ .for_reclaim = 0
+ };
+
+ if (!mapping->a_ops->writepage)
+ /* No write method for the address space */
+ return;
+
+ if (!clear_page_dirty_for_io(page))
+ /* Someone else already triggered a write */
+ return;
+
+ rc = mapping->a_ops->writepage(page, &wbc);
+ if (rc < 0)
+ /* I/O Error writing */
+ return;
+
+ if (rc == AOP_WRITEPAGE_ACTIVATE)
+ unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+ struct page *page;
+ struct buffer_head *bh;
+ int i, j;
+ int n = 0;
+
+ for (i = 0; i < nr; i++) {
+ bh = v[i];
+ v[i] = NULL;
+
+ page = bh->b_page;
+
+ if (page && PagePrivate(page)) {
+ for (j = 0; j < n; j++)
+ if (page == v[j])
+ continue;
+ }
+
+ if (get_page_unless_zero(page))
+ v[n++] = page;
+ }
+ return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was set up by get_buffer().
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+ void *private)
+{
+ struct page *page;
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ page = v[i];
+
+ if (!page || PageWriteback(page))
+ continue;
+
+ if (!TestSetPageLocked(page)) {
+ if (PageDirty(page))
+ trigger_write(page);
+ else {
+ if (PagePrivate(page))
+ try_to_free_buffers(page);
+ unlock_page(page);
+ }
+ }
+ put_page(page);
+ }
+}
+
static void
init_buffer_head(void *data)
{
@@ -3334,6 +3432,7 @@
(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
SLAB_MEM_SPREAD),
init_buffer_head);
+ kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
/*
* Limit the bh occupancy to 10% of ZONE_NORMAL
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 11/19] inodes: Support generic defragmentation
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (9 preceding siblings ...)
2008-05-10 2:21 ` [patch 10/19] buffer heads: Support slab defrag Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 12/19] Filesystem: Ext2 filesystem defrag Christoph Lameter
` (8 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0025-inodes-Support-generic-defragmentation.patch --]
[-- Type: text/plain, Size: 5124 bytes --]
This implements the ability to remove inodes in a particular slab
from inode caches. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.
Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/inode.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 6 ++
2 files changed, 129 insertions(+)
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/inode.c 2008-07-31 12:18:15.000000000 -0500
@@ -1363,6 +1363,128 @@
__setup("ihash_entries=", set_ihash_entries);
/*
+ * Obtain a refcount on a list of struct inodes pointed to by v. If the
+ * inode is in the process of being freed then zap the v[] entry so that
+ * we skip the freeing attempts later.
+ *
+ * This is a generic function for the ->get slab defrag callback.
+ */
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ int i;
+
+ spin_lock(&inode_lock);
+ for (i = 0; i < nr; i++) {
+ struct inode *inode = v[i];
+
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+ v[i] = NULL;
+ else
+ __iget(inode);
+ }
+ spin_unlock(&inode_lock);
+ return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * fs inode. The offset is the offset of the struct inode in the fs inode.
+ *
+ * The function adds to the pointers in v[] in order to make them point to
+ * struct inode. Then get_inodes() is used to get the refcount.
+ * The converted v[] pointers can then also be passed to the kick() callback
+ * without further processing.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+ unsigned long offset)
+{
+ int i;
+
+ for (i = 0; i < nr; i++)
+ v[i] += offset;
+
+ return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+/*
+ * Generic callback function slab defrag ->kick methods. Takes the
+ * array with inodes where we obtained refcounts using fs_get_inodes()
+ * or get_inodes() and tries to free them.
+ */
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+ struct inode *inode;
+ int i;
+ int abort = 0;
+ LIST_HEAD(freeable);
+ int active;
+
+ for (i = 0; i < nr; i++) {
+ inode = v[i];
+ if (!inode)
+ continue;
+
+ if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ if (remove_inode_buffers(inode))
+ /*
+ * Should we really be doing this? Or
+ * limit the writeback here to only a few pages?
+ *
+ * Possibly an expensive operation but we
+ * cannot reclaim the inode if the pages
+ * are still present.
+ */
+ invalidate_mapping_pages(&inode->i_data,
+ 0, -1);
+ }
+
+ /* Invalidate children and dentry */
+ if (S_ISDIR(inode->i_mode)) {
+ struct dentry *d = d_find_alias(inode);
+
+ if (d) {
+ d_invalidate(d);
+ dput(d);
+ }
+ }
+
+ if (inode->i_state & I_DIRTY)
+ write_inode_now(inode, 1);
+
+ d_prune_aliases(inode);
+ }
+
+ mutex_lock(&iprune_mutex);
+ for (i = 0; i < nr; i++) {
+ inode = v[i];
+
+ if (!inode)
+ /* inode is alrady being freed */
+ continue;
+
+ active = inode->i_sb->s_flags & MS_ACTIVE;
+ iput(inode);
+ if (abort || !active)
+ continue;
+
+ spin_lock(&inode_lock);
+ abort = !can_unuse(inode);
+
+ if (!abort) {
+ list_move(&inode->i_list, &freeable);
+ inode->i_state |= I_FREEING;
+ inodes_stat.nr_unused--;
+ }
+ spin_unlock(&inode_lock);
+ }
+ dispose_list(&freeable);
+ mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+/*
* Initialize the waitqueues and inode hash table.
*/
void __init inode_init_early(void)
@@ -1401,6 +1523,7 @@
SLAB_MEM_SPREAD),
init_once);
register_shrinker(&icache_shrinker);
+ kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
/* Hash may have been set up in inode_init_early */
if (!hashdist)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/include/linux/fs.h 2008-07-31 12:18:15.000000000 -0500
@@ -1844,6 +1844,12 @@
__insert_inode_hash(inode, inode->i_ino);
}
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+ unsigned long offset);
+
extern struct file * get_empty_filp(void);
extern void file_move(struct file *f, struct list_head *list);
extern void file_kill(struct file *f);
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 12/19] Filesystem: Ext2 filesystem defrag
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (10 preceding siblings ...)
2008-05-10 2:21 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 13/19] Filesystem: Ext3 " Christoph Lameter
` (7 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: ext2-defrag --]
[-- Type: text/plain, Size: 1035 bytes --]
Support defragmentation for ext2 filesystem inodes
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/ext2/super.c | 9 +++++++++
1 file changed, 9 insertions(+)
Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/ext2/super.c 2008-07-31 12:18:15.000000000 -0500
@@ -171,6 +171,12 @@
inode_init_once(&ei->vfs_inode);
}
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ return fs_get_inodes(s, nr, v,
+ offsetof(struct ext2_inode_info, vfs_inode));
+}
+
static int init_inodecache(void)
{
ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
@@ -180,6 +186,9 @@
init_once);
if (ext2_inode_cachep == NULL)
return -ENOMEM;
+
+ kmem_cache_setup_defrag(ext2_inode_cachep,
+ ext2_get_inodes, kick_inodes);
return 0;
}
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 13/19] Filesystem: Ext3 filesystem defrag
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (11 preceding siblings ...)
2008-05-10 2:21 ` [patch 12/19] Filesystem: Ext2 filesystem defrag Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
` (6 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: ext3-defrag --]
[-- Type: text/plain, Size: 1032 bytes --]
Support defragmentation for ext3 filesystem inodes
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/ext3/super.c | 8 ++++++++
1 file changed, 8 insertions(+)
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/ext3/super.c 2008-07-31 12:18:15.000000000 -0500
@@ -484,6 +484,12 @@
inode_init_once(&ei->vfs_inode);
}
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ return fs_get_inodes(s, nr, v,
+ offsetof(struct ext3_inode_info, vfs_inode));
+}
+
static int init_inodecache(void)
{
ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
@@ -493,6 +499,8 @@
init_once);
if (ext3_inode_cachep == NULL)
return -ENOMEM;
+ kmem_cache_setup_defrag(ext3_inode_cachep,
+ ext3_get_inodes, kick_inodes);
return 0;
}
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 14/19] Filesystem: Ext4 filesystem defrag
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (12 preceding siblings ...)
2008-05-10 2:21 ` [patch 13/19] Filesystem: Ext3 " Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-08-03 1:54 ` Theodore Tso
2008-05-10 2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
` (5 subsequent siblings)
19 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: ext4-defrag --]
[-- Type: text/plain, Size: 1032 bytes --]
Support defragmentation for extX filesystem inodes
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/ext4/super.c | 8 ++++++++
1 file changed, 8 insertions(+)
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/ext4/super.c 2008-07-31 12:18:15.000000000 -0500
@@ -607,6 +607,12 @@
inode_init_once(&ei->vfs_inode);
}
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ return fs_get_inodes(s, nr, v,
+ offsetof(struct ext4_inode_info, vfs_inode));
+}
+
static int init_inodecache(void)
{
ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
@@ -616,6 +622,8 @@
init_once);
if (ext4_inode_cachep == NULL)
return -ENOMEM;
+ kmem_cache_setup_defrag(ext4_inode_cachep,
+ ext4_get_inodes, kick_inodes);
return 0;
}
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 15/19] Filesystem: XFS slab defragmentation
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (13 preceding siblings ...)
2008-05-10 2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-08-03 1:42 ` Dave Chinner
2008-05-10 2:21 ` [patch 16/19] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
` (4 subsequent siblings)
19 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0027-FS-XFS-slab-defragmentation.patch --]
[-- Type: text/plain, Size: 877 bytes --]
Support inode defragmentation for xfs
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/xfs/linux-2.6/xfs_super.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c 2008-07-31 12:18:15.000000000 -0500
@@ -861,6 +861,7 @@
xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
if (!xfs_ioend_zone)
goto out_destroy_vnode_zone;
+ kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
xfs_ioend_zone);
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 16/19] Filesystem: /proc filesystem support for slab defrag
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (14 preceding siblings ...)
2008-05-10 2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 17/19] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
` (3 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Alexey Dobriyan, Christoph Lameter, Christoph Lameter,
linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm,
Dave Chinner
[-- Attachment #1: 0028-FS-Proc-filesystem-support-for-slab-defrag.patch --]
[-- Type: text/plain, Size: 1096 bytes --]
Support procfs inode defragmentation
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/proc/inode.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
Index: linux-2.6/fs/proc/inode.c
===================================================================
--- linux-2.6.orig/fs/proc/inode.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/proc/inode.c 2008-07-31 12:18:15.000000000 -0500
@@ -106,6 +106,12 @@
inode_init_once(&ei->vfs_inode);
}
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ return fs_get_inodes(s, nr, v,
+ offsetof(struct proc_inode, vfs_inode));
+};
+
int __init proc_init_inodecache(void)
{
proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -113,6 +119,8 @@
0, (SLAB_RECLAIM_ACCOUNT|
SLAB_MEM_SPREAD|SLAB_PANIC),
init_once);
+ kmem_cache_setup_defrag(proc_inode_cachep,
+ proc_get_inodes, kick_inodes);
return 0;
}
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 17/19] Filesystem: Slab defrag: Reiserfs support
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (15 preceding siblings ...)
2008-05-10 2:21 ` [patch 16/19] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 18/19] dentries: Add constructor Christoph Lameter
` (2 subsequent siblings)
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Christoph Lameter, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0029-FS-Slab-defrag-Reiserfs-support.patch --]
[-- Type: text/plain, Size: 1073 bytes --]
Slab defragmentation: Support reiserfs inode defragmentation.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/reiserfs/super.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
Index: linux-2.6/fs/reiserfs/super.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/super.c 2008-07-31 12:18:12.000000000 -0500
+++ linux-2.6/fs/reiserfs/super.c 2008-07-31 12:18:15.000000000 -0500
@@ -533,6 +533,12 @@
#endif
}
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ return fs_get_inodes(s, nr, v,
+ offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
static int init_inodecache(void)
{
reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -543,6 +549,8 @@
init_once);
if (reiserfs_inode_cachep == NULL)
return -ENOMEM;
+ kmem_cache_setup_defrag(reiserfs_inode_cachep,
+ reiserfs_get_inodes, kick_inodes);
return 0;
}
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 18/19] dentries: Add constructor
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (16 preceding siblings ...)
2008-05-10 2:21 ` [patch 17/19] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-05-10 2:21 ` [patch 19/19] dentries: dentry defragmentation Christoph Lameter
2008-08-03 1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0031-dentries-Add-constructor.patch --]
[-- Type: text/plain, Size: 2156 bytes --]
In order to support defragmentation on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.
So provide a constructor.
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/dcache.c | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2008-07-31 12:18:11.000000000 -0500
+++ linux-2.6/fs/dcache.c 2008-07-31 12:18:27.000000000 -0500
@@ -899,6 +899,16 @@
.seeks = DEFAULT_SEEKS,
};
+static void dcache_ctor(void *p)
+{
+ struct dentry *dentry = p;
+
+ spin_lock_init(&dentry->d_lock);
+ dentry->d_inode = NULL;
+ INIT_LIST_HEAD(&dentry->d_lru);
+ INIT_LIST_HEAD(&dentry->d_alias);
+}
+
/**
* d_alloc - allocate a dcache entry
* @parent: parent of entry to allocate
@@ -936,8 +946,6 @@
atomic_set(&dentry->d_count, 1);
dentry->d_flags = DCACHE_UNHASHED;
- spin_lock_init(&dentry->d_lock);
- dentry->d_inode = NULL;
dentry->d_parent = NULL;
dentry->d_sb = NULL;
dentry->d_op = NULL;
@@ -947,9 +955,7 @@
dentry->d_cookie = NULL;
#endif
INIT_HLIST_NODE(&dentry->d_hash);
- INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
- INIT_LIST_HEAD(&dentry->d_alias);
if (parent) {
dentry->d_parent = dget(parent);
@@ -2174,14 +2180,10 @@
{
int loop;
- /*
- * A constructor could be added for stable state like the lists,
- * but it is probably not worth it because of the cache nature
- * of the dcache.
- */
- dentry_cache = KMEM_CACHE(dentry,
- SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-
+ dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
+ 0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+ dcache_ctor);
+
register_shrinker(&dcache_shrinker);
/* Hash may have been set up in dcache_init_early */
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 19/19] dentries: dentry defragmentation
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (17 preceding siblings ...)
2008-05-10 2:21 ` [patch 18/19] dentries: Add constructor Christoph Lameter
@ 2008-05-10 2:21 ` Christoph Lameter
2008-08-03 1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
19 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-05-10 2:21 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0032-dentries-dentry-defragmentation.patch --]
[-- Type: text/plain, Size: 4092 bytes --]
The dentry pruning for unused entries works in a straightforward way. It
could be made more aggressive if one would actually move dentries instead
of just reclaiming them.
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/dcache.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 100 insertions(+), 1 deletion(-)
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2008-07-31 12:18:15.000000000 -0500
+++ linux-2.6/fs/dcache.c 2008-07-31 12:18:15.000000000 -0500
@@ -32,6 +32,7 @@
#include <linux/seqlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
+#include <linux/backing-dev.h>
#include "internal.h"
@@ -172,7 +173,10 @@
list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
- /*drops the locks, at that point nobody can reach this dentry */
+ /*
+ * drops the locks, at that point nobody (aside from defrag)
+ * can reach this dentry
+ */
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
@@ -2176,6 +2180,100 @@
INIT_HLIST_HEAD(&dentry_hashtable[loop]);
}
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+ struct dentry *dentry;
+ int i;
+
+ spin_lock(&dcache_lock);
+ for (i = 0; i < nr; i++) {
+ dentry = v[i];
+
+ /*
+ * Three sorts of dentries cannot be reclaimed:
+ *
+ * 1. dentries that are in the process of being allocated
+ * or being freed. In that case the dentry is neither
+ * on the LRU nor hashed.
+ *
+ * 2. Fake hashed entries as used for anonymous dentries
+ * and pipe I/O. The fake hashed entries have d_flags
+ * set to indicate a hashed entry. However, the
+ * d_hash field indicates that the entry is not hashed.
+ *
+ * 3. dentries that have a backing store that is not
+ * writable. This is true for tmpsfs and other in
+ * memory filesystems. Removing dentries from them
+ * would loose dentries for good.
+ */
+ if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+ (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
+ (dentry->d_inode &&
+ !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+ /* Ignore this dentry */
+ v[i] = NULL;
+ else
+ /* dget_locked will remove the dentry from the LRU */
+ dget_locked(dentry);
+ }
+ spin_unlock(&dcache_lock);
+ return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+ int nr, void **v, void *private)
+{
+ struct dentry *dentry;
+ int i;
+
+ /*
+ * First invalidate the dentries without holding the dcache lock
+ */
+ for (i = 0; i < nr; i++) {
+ dentry = v[i];
+
+ if (dentry)
+ d_invalidate(dentry);
+ }
+
+ /*
+ * If we are the last one holding a reference then the dentries can
+ * be freed. We need the dcache_lock.
+ */
+ spin_lock(&dcache_lock);
+ for (i = 0; i < nr; i++) {
+ dentry = v[i];
+ if (!dentry)
+ continue;
+
+ spin_lock(&dentry->d_lock);
+ if (atomic_read(&dentry->d_count) > 1) {
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ dput(dentry);
+ spin_lock(&dcache_lock);
+ continue;
+ }
+
+ prune_one_dentry(dentry);
+ }
+ spin_unlock(&dcache_lock);
+
+ /*
+ * dentries are freed using RCU so we need to wait until RCU
+ * operations are complete.
+ */
+ synchronize_rcu();
+}
+
static void __init dcache_init(void)
{
int loop;
@@ -2185,6 +2283,7 @@
dcache_ctor);
register_shrinker(&dcache_shrinker);
+ kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [patch 15/19] Filesystem: XFS slab defragmentation
2008-05-10 2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
@ 2008-08-03 1:42 ` Dave Chinner
2008-08-04 13:36 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2008-08-03 1:42 UTC (permalink / raw)
To: Christoph Lameter
Cc: Pekka Enberg, akpm, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm
On Fri, May 09, 2008 at 07:21:16PM -0700, Christoph Lameter wrote:
> Support inode defragmentation for xfs
>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> ---
> fs/xfs/linux-2.6/xfs_super.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c 2008-07-31 12:18:12.000000000 -0500
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c 2008-07-31 12:18:15.000000000 -0500
> @@ -861,6 +861,7 @@
> xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
> if (!xfs_ioend_zone)
> goto out_destroy_vnode_zone;
> + kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
>
> xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
> xfs_ioend_zone);
I think that hunk is mis-applied. You're configuring the
xfs_vnode_zone defrag after allocating the xfs_ioend_zone. This
should be afew lines higher up, right?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [patch 14/19] Filesystem: Ext4 filesystem defrag
2008-05-10 2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
@ 2008-08-03 1:54 ` Theodore Tso
2008-08-13 7:26 ` Pekka Enberg
0 siblings, 1 reply; 64+ messages in thread
From: Theodore Tso @ 2008-08-03 1:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Pekka Enberg, akpm, Christoph Lameter, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, mpm, Dave Chinner
On Fri, May 09, 2008 at 07:21:15PM -0700, Christoph Lameter wrote:
> Support defragmentation for extX filesystem inodes
You forgot to change "extX" to "ext4". :-)
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
- Ted
^ permalink raw reply [flat|nested] 64+ messages in thread
* No, really, stop trying to delete slab until you've finished making slub perform as well
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
` (18 preceding siblings ...)
2008-05-10 2:21 ` [patch 19/19] dentries: dentry defragmentation Christoph Lameter
@ 2008-08-03 1:58 ` Matthew Wilcox
2008-08-03 21:25 ` Pekka Enberg
2008-08-04 13:43 ` Christoph Lameter
19 siblings, 2 replies; 64+ messages in thread
From: Matthew Wilcox @ 2008-08-03 1:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Pekka Enberg, akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
Rik van Riel
On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> defrag (Either of those could be theoretically equipped to support
> slab defrag in some way but it seems that Andrew/Linus want to reduce
> the number of slab allocators).
Do we have to once again explain that slab still outperforms slub on at
least one important benchmark? I hope Nick Piggin finds time to finish
tuning slqb; it already outperforms slub.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-03 1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
@ 2008-08-03 21:25 ` Pekka Enberg
2008-08-04 2:37 ` Rene Herman
2008-08-04 13:43 ` Christoph Lameter
1 sibling, 1 reply; 64+ messages in thread
From: Pekka Enberg @ 2008-08-03 21:25 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
andi, Rik van Riel
Hi Matthew,
Matthew Wilcox wrote:
> Do we have to once again explain that slab still outperforms slub on at
> least one important benchmark? I hope Nick Piggin finds time to finish
> tuning slqb; it already outperforms slub.
No, you don't have to. I haven't merged that patch nor do I intend to do
so until the regressions are fixed.
And yes, I'm still waiting to hear from you how we're now doing with
higher order page allocations...
Pekka
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-03 21:25 ` Pekka Enberg
@ 2008-08-04 2:37 ` Rene Herman
2008-08-04 21:22 ` Pekka Enberg
0 siblings, 1 reply; 64+ messages in thread
From: Rene Herman @ 2008-08-04 2:37 UTC (permalink / raw)
To: Pekka Enberg
Cc: Matthew Wilcox, Christoph Lameter, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
On 03-08-08 23:25, Pekka Enberg wrote:
> Matthew Wilcox wrote:
>> Do we have to once again explain that slab still outperforms slub on at
>> least one important benchmark? I hope Nick Piggin finds time to finish
>> tuning slqb; it already outperforms slub.
>
> No, you don't have to. I haven't merged that patch nor do I intend to do
> so until the regressions are fixed.
>
> And yes, I'm still waiting to hear from you how we're now doing with
> higher order page allocations...
General interested question -- I recently "accidentally" read some of
slub and I believe that it doesn't feature the cache colouring support
that slab did? Is that true, and if so, wasn't it needed/useful?
Rene.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [patch 15/19] Filesystem: XFS slab defragmentation
2008-08-03 1:42 ` Dave Chinner
@ 2008-08-04 13:36 ` Christoph Lameter
0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 13:36 UTC (permalink / raw)
To: Christoph Lameter, Pekka Enberg, akpm, Christoph Lameter,
linux-kernel, linu
Dave Chinner wrote:
> I think that hunk is mis-applied. You're configuring the
> xfs_vnode_zone defrag after allocating the xfs_ioend_zone. This
> should be afew lines higher up, right?
That would be nicer but its not a bug to have the setup where it is right now.
Fix:
Subject: defrag/xfs: Move defrag setup directly after xfs_vnode_zone kmem
cache creation
Move the setup of the defrag directly after the creation of the xfs_vnode_zone
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c 2008-08-04 08:27:09.000000000
-0500
+++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c 2008-08-04 08:27:25.000000000 -0500
@@ -2021,11 +2021,11 @@
if (!xfs_vnode_zone)
goto out;
+ kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
+
xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
if (!xfs_ioend_zone)
goto out_destroy_vnode_zone;
- kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
-
xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
xfs_ioend_zone);
if (!xfs_ioend_pool)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-03 1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
2008-08-03 21:25 ` Pekka Enberg
@ 2008-08-04 13:43 ` Christoph Lameter
2008-08-04 14:48 ` Jamie Lokier
` (2 more replies)
1 sibling, 3 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 13:43 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pekka Enberg, akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
Rik van Riel
Matthew Wilcox wrote:
> On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
>> - Add a patch that obsoletes SLAB and explains why SLOB does not support
>> defrag (Either of those could be theoretically equipped to support
>> slab defrag in some way but it seems that Andrew/Linus want to reduce
>> the number of slab allocators).
>
> Do we have to once again explain that slab still outperforms slub on at
> least one important benchmark? I hope Nick Piggin finds time to finish
> tuning slqb; it already outperforms slub.
>
Uhh. I forgot to delete that statement. I did not include the patch in the series.
We have a fundamental issue design issue there. Queuing on free can result in
better performance as in SLAB. However, it limits concurrency (per node lock
taking) and causes latency spikes due to queue processing (f.e. one test load
had 118.65 vs. 34 usecs just by switching to SLUB).
Could you address the performance issues in different ways? F.e. try to free
when the object is hot or free from multiple processors? SLAB has to take the
list_lock rather frequently under high concurrent loads (depends on queue
size). That will not occur with SLUB. So you actually can free (and allocate)
concurrently with high performance.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 13:43 ` Christoph Lameter
@ 2008-08-04 14:48 ` Jamie Lokier
2008-08-04 15:21 ` Jamie Lokier
2008-08-04 15:11 ` Rik van Riel
2008-08-04 16:47 ` KOSAKI Motohiro
2 siblings, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2008-08-04 14:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
Christoph Lameter wrote:
> Matthew Wilcox wrote:
> > On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> >> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> >> defrag (Either of those could be theoretically equipped to support
> >> slab defrag in some way but it seems that Andrew/Linus want to reduce
> >> the number of slab allocators).
> >
> > Do we have to once again explain that slab still outperforms slub on at
> > least one important benchmark? I hope Nick Piggin finds time to finish
> > tuning slqb; it already outperforms slub.
> >
>
> Uhh. I forgot to delete that statement. I did not include the patch
> in the series.
>
> We have a fundamental issue design issue there. Queuing on free can result in
> better performance as in SLAB. However, it limits concurrency (per node lock
> taking) and causes latency spikes due to queue processing (f.e. one test load
> had 118.65 vs. 34 usecs just by switching to SLUB).
Vaguely on this topic, has anyone studied the effects of SLAB/SLUB
etc. on MMUless systems?
-- Jamie
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 13:43 ` Christoph Lameter
2008-08-04 14:48 ` Jamie Lokier
@ 2008-08-04 15:11 ` Rik van Riel
2008-08-04 16:02 ` Christoph Lameter
2008-08-04 16:47 ` KOSAKI Motohiro
2 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2008-08-04 15:11 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi
On Mon, 04 Aug 2008 08:43:21 -0500
Christoph Lameter <cl@linux-foundation.org> wrote:
> Matthew Wilcox wrote:
> > On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> >> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> >> defrag (Either of those could be theoretically equipped to support
> >> slab defrag in some way but it seems that Andrew/Linus want to reduce
> >> the number of slab allocators).
> >
> > Do we have to once again explain that slab still outperforms slub on at
> > least one important benchmark? I hope Nick Piggin finds time to finish
> > tuning slqb; it already outperforms slub.
> >
>
> Uhh. I forgot to delete that statement. I did not include the patch in the series.
>
> We have a fundamental issue design issue there. Queuing on free can result in
> better performance as in SLAB. However, it limits concurrency (per node lock
> taking) and causes latency spikes due to queue processing (f.e. one test load
> had 118.65 vs. 34 usecs just by switching to SLUB).
>
> Could you address the performance issues in different ways? F.e. try to free
> when the object is hot or free from multiple processors? SLAB has to take the
> list_lock rather frequently under high concurrent loads (depends on queue
> size). That will not occur with SLUB. So you actually can free (and allocate)
> concurrently with high performance.
I guess you could bypass the queueing on free for objects that
come from a "local" SLUB page, only queueing objects that go
onto remote pages.
That way workloads that already perform well with SLUB should
keep the current performance, while workloads that currently
perform badly with SLUB should get an improvement.
--
All Rights Reversed
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 14:48 ` Jamie Lokier
@ 2008-08-04 15:21 ` Jamie Lokier
2008-08-04 16:35 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2008-08-04 15:21 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
Jamie Lokier wrote:
> Vaguely on this topic, has anyone studied the effects of SLAB/SLUB
> etc. on MMUless systems?
The reason is that MMU-less systems are extremely sensitive to
fragmentation. Every program started on those systems must allocate a
large contiguous block for the code and data, and every malloc >1 page
is the same. If memory is too fragmented, starting new programs fails.
The high-order page-allocator defragmentation lately should help with
that.
The different behaviours of SLAB/SLUB might result in different levels
of fragmentation, so I wonder if anyone has compared them on MMU-less
systems or fragmentation-sensitive workloads on general systems.
Thanks,
-- Jamie
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 15:11 ` Rik van Riel
@ 2008-08-04 16:02 ` Christoph Lameter
0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 16:02 UTC (permalink / raw)
To: Rik van Riel
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi
Rik van Riel wrote:
> I guess you could bypass the queueing on free for objects that
> come from a "local" SLUB page, only queueing objects that go
> onto remote pages.
Tried that already. The logic to decide if an object is local is creating
significant overhead. Plus you need queues for the remote nodes. Back to alien
queues?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 15:21 ` Jamie Lokier
@ 2008-08-04 16:35 ` Christoph Lameter
0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 16:35 UTC (permalink / raw)
To: Jamie Lokier
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
Jamie Lokier wrote:
> The different behaviours of SLAB/SLUB might result in different levels
> of fragmentation, so I wonder if anyone has compared them on MMU-less
> systems or fragmentation-sensitive workloads on general systems.
Never heard of such a comparison.
MMU less systems typically have a minimal number of processors. For that
configuration the page orders are roughly equivalent to slab. Larger orders
come into play with large amounts of processors.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 13:43 ` Christoph Lameter
2008-08-04 14:48 ` Jamie Lokier
2008-08-04 15:11 ` Rik van Riel
@ 2008-08-04 16:47 ` KOSAKI Motohiro
2008-08-04 17:13 ` Christoph Lameter
2008-08-04 17:19 ` Christoph Lameter
2 siblings, 2 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-04 16:47 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel, kosaki.motohiro
Hi
> Could you address the performance issues in different ways? F.e. try to free
> when the object is hot or free from multiple processors? SLAB has to take the
> list_lock rather frequently under high concurrent loads (depends on queue
> size). That will not occur with SLUB. So you actually can free (and allocate)
> concurrently with high performance.
just information. (offtopic?)
When hackbench running, SLUB consume memory very largely than SLAB.
then, SLAB often outperform SLUB in memory stavation state.
I don't know why memory comsumption different.
Anyone know it?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 16:47 ` KOSAKI Motohiro
@ 2008-08-04 17:13 ` Christoph Lameter
2008-08-04 17:20 ` Pekka Enberg
2008-08-05 12:06 ` KOSAKI Motohiro
2008-08-04 17:19 ` Christoph Lameter
1 sibling, 2 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 17:13 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel, kosaki.motohiro
KOSAKI Motohiro wrote:
> When hackbench running, SLUB consume memory very largely than SLAB.
> then, SLAB often outperform SLUB in memory stavation state.
>
> I don't know why memory comsumption different.
> Anyone know it?
Can you quantify the difference?
SLAB buffers objects in its queues. SLUB does rely more on the page allocator.
So SLAB may have its own reserves to fall back on.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 16:47 ` KOSAKI Motohiro
2008-08-04 17:13 ` Christoph Lameter
@ 2008-08-04 17:19 ` Christoph Lameter
1 sibling, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 17:19 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel, kosaki.motohiro
KOSAKI Motohiro wrote:
>
> When hackbench running, SLUB consume memory very largely than SLAB.
> then, SLAB often outperform SLUB in memory stavation state.
Re memory use: If SLUB finds that there is lock contention on a slab page then
it will allocate a new one and dedicate it to a cpu in order to avoid future
contentions.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 17:13 ` Christoph Lameter
@ 2008-08-04 17:20 ` Pekka Enberg
2008-08-05 12:06 ` KOSAKI Motohiro
1 sibling, 0 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-04 17:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: KOSAKI Motohiro, Matthew Wilcox, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel, kosaki.motohiro
On Mon, Aug 4, 2008 at 8:13 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> KOSAKI Motohiro wrote:
>
>> When hackbench running, SLUB consume memory very largely than SLAB.
>> then, SLAB often outperform SLUB in memory stavation state.
>>
>> I don't know why memory comsumption different.
>> Anyone know it?
>
> Can you quantify the difference?
>
> SLAB buffers objects in its queues. SLUB does rely more on the page allocator.
> So SLAB may have its own reserves to fall back on.
Also, what kind of machine are we talking about here? If there are a
lot of CPUs, SLUB will allocate higher order pages more aggressively
than SLAB by default.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 2:37 ` Rene Herman
@ 2008-08-04 21:22 ` Pekka Enberg
2008-08-04 21:41 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: Pekka Enberg @ 2008-08-04 21:22 UTC (permalink / raw)
To: Rene Herman
Cc: Matthew Wilcox, Christoph Lameter, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
Rene Herman wrote:
> On 03-08-08 23:25, Pekka Enberg wrote:
>
>> Matthew Wilcox wrote:
>
>>> Do we have to once again explain that slab still outperforms slub on at
>>> least one important benchmark? I hope Nick Piggin finds time to finish
>>> tuning slqb; it already outperforms slub.
>>
>> No, you don't have to. I haven't merged that patch nor do I intend to
>> do so until the regressions are fixed.
>>
>> And yes, I'm still waiting to hear from you how we're now doing with
>> higher order page allocations...
>
> General interested question -- I recently "accidentally" read some of
> slub and I believe that it doesn't feature the cache colouring support
> that slab did? Is that true, and if so, wasn't it needed/useful?
I don't know why Christoph decided not to implement it. Christoph?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 21:22 ` Pekka Enberg
@ 2008-08-04 21:41 ` Christoph Lameter
2008-08-04 23:09 ` Rene Herman
0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-04 21:41 UTC (permalink / raw)
To: Pekka Enberg
Cc: Rene Herman, Matthew Wilcox, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
Pekka Enberg wrote:
>> General interested question -- I recently "accidentally" read some of
>> slub and I believe that it doesn't feature the cache colouring support
>> that slab did? Is that true, and if so, wasn't it needed/useful?
>
> I don't know why Christoph decided not to implement it. Christoph?
IMHO cache coloring issues seem to be mostly taken care of by newer more
associative cpu caching designs.
Note that the SLAB design origin is Solaris (See the paper by Jeff Bonwick in
1994 that is quoted in mm/slab.c). Logic for cache coloring is mostly avoided
today due to the complexity it would introduce. See also
http://en.wikipedia.org/wiki/CPU_cache.
What one could add to support cache coloring in SLUB is a prearrangement of
the order of object allocation order by constructing the initial freelist for
a page in a certain way. See mm/slub.c::new_slab()
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 21:41 ` Christoph Lameter
@ 2008-08-04 23:09 ` Rene Herman
0 siblings, 0 replies; 64+ messages in thread
From: Rene Herman @ 2008-08-04 23:09 UTC (permalink / raw)
To: Christoph Lameter
Cc: Pekka Enberg, Matthew Wilcox, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
On 04-08-08 23:41, Christoph Lameter wrote:
>>> General interested question -- I recently "accidentally" read some of
>>> slub and I believe that it doesn't feature the cache colouring support
>>> that slab did? Is that true, and if so, wasn't it needed/useful?
>> I don't know why Christoph decided not to implement it. Christoph?
>
> IMHO cache coloring issues seem to be mostly taken care of by newer more
> associative cpu caching designs.
I see. Just gathered a bit of data on this (from sandpile.org):
32-byte lines:
P54 : L1 I 8K, 2-Way
D 8K, 2-Way
L2 External
P55 : L1 I 16K, 4-Way
D 16K, 4-Way
L2 External
P2 : L1 I 16K 4-Way
D 16K 4-Way
L2 128K to 2MB 4-Way
P3 : L1 I 16K 4-Way
D 16K 4-Way
L2 128K to 2MB 4-Way or
256K to 2MB 8-Way
64-byte lines:
P4 : L1 I 12K uOP Trace (8-Way, 6 uOP line)
D 8K 4-Way or
16K 8-Way
L2 128K 2-Way or
128K, 256K 4-Way or
512K, 1M, 2M 8-Way
L3 512K 4-Way or
1M to 8M 8-Way or
2M to 16M 16-Way
Core: L1 I 32K 8-Way
D 32K 8-Way
L2 512K 2-Way or
1M 4-Way or
2M 8-Way or
3M 12-Way or
4M 16-Way
K7 : L1 I 64K 2-Way
D 64K 2-Way
L2 512, 1M, 2M 2-Way or
4M, 8M 1-Way or
64K, 256K, 512K 16-Way
K8 : L1 I 64K 2-Way
D 64K 2-Way
L2 128K to 1M 16-Way
The L1 on K7 and K8 especially seems still a bit of worry here.
> Note that the SLAB design origin is Solaris (See the paper by Jeff Bonwick in
> 1994 that is quoted in mm/slab.c). Logic for cache coloring is mostly avoided
> today due to the complexity it would introduce. See also
> http://en.wikipedia.org/wiki/CPU_cache.
>
> What one could add to support cache coloring in SLUB is a prearrangement of
> the order of object allocation order by constructing the initial freelist for
> a page in a certain way. See mm/slub.c::new_slab()
<remains silent>
To me, colouring always seemed like a fairly promising thing but I won't
pretend to have any sort of data.
Rene.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-04 17:13 ` Christoph Lameter
2008-08-04 17:20 ` Pekka Enberg
@ 2008-08-05 12:06 ` KOSAKI Motohiro
2008-08-05 14:59 ` Christoph Lameter
1 sibling, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-05 12:06 UTC (permalink / raw)
To: Christoph Lameter
Cc: kosaki.motohiro, KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg,
akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel
> KOSAKI Motohiro wrote:
>
> > When hackbench running, SLUB consume memory very largely than SLAB.
> > then, SLAB often outperform SLUB in memory stavation state.
> >
> > I don't know why memory comsumption different.
> > Anyone know it?
>
> Can you quantify the difference?
machine spec:
CPU: IA64 x 8
MEM: 8G (4G x2node)
test method
1. echo 3 >/proc/sys/vm/drop_caches
2. % ./hackbench 90 process 1000 <- for fill pagetable cache
3. % ./hackbench 90 process 1000
vmstat result
<SLAB (without CONFIG_DEBUG_SLAB)>
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 3223168 6016 38336 0 0 0 0 3181 4314 0 15 85 0 0
2039 2 0 2022144 6016 38336 0 0 0 0 2364 13622 0 49 51 0 0
634 0 0 2629824 6080 38336 0 0 0 64 83582 2538927 5 95 0 0 0
596 0 0 2842624 6080 38336 0 0 0 0 6864 675841 6 94 0 0 0
590 0 0 2993472 6080 38336 0 0 0 0 9514 456085 6 94 0 0 0
503 0 0 3138560 6080 38336 0 0 0 0 8042 276024 4 96 0 0 0
about 3G remain.
<SLUB>
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1066 0 0 323008 3584 18240 0 0 0 0 12037 47353 1 99 0 0 0
1101 0 0 324672 3584 18240 0 0 0 0 6029 25100 1 99 0 0 0
913 0 0 330240 3584 18240 0 0 0 0 9694 54951 2 98 0 0 0
about 300M remain.
So, about 2.5G - 3G difference in 8G mem.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-05 12:06 ` KOSAKI Motohiro
@ 2008-08-05 14:59 ` Christoph Lameter
2008-08-06 12:36 ` KOSAKI Motohiro
2008-08-13 10:46 ` KOSAKI Motohiro
0 siblings, 2 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-05 14:59 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
>> Can you quantify the difference?
>
> machine spec:
> CPU: IA64 x 8
> MEM: 8G (4G x2node)
16k or 64k page size?
> procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 2 0 0 3223168 6016 38336 0 0 0 0 3181 4314 0 15 85 0 0
> 2039 2 0 2022144 6016 38336 0 0 0 0 2364 13622 0 49 51 0 0
> 634 0 0 2629824 6080 38336 0 0 0 64 83582 2538927 5 95 0 0 0
> 596 0 0 2842624 6080 38336 0 0 0 0 6864 675841 6 94 0 0 0
> 590 0 0 2993472 6080 38336 0 0 0 0 9514 456085 6 94 0 0 0
> 503 0 0 3138560 6080 38336 0 0 0 0 8042 276024 4 96 0 0 0
>
> about 3G remain.
>
> <SLUB>
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 1066 0 0 323008 3584 18240 0 0 0 0 12037 47353 1 99 0 0 0
> 1101 0 0 324672 3584 18240 0 0 0 0 6029 25100 1 99 0 0 0
> 913 0 0 330240 3584 18240 0 0 0 0 9694 54951 2 98 0 0 0
>
> about 300M remain.
>
>
> So, about 2.5G - 3G difference in 8G mem.
Well not sure if that tells us much. Please show us the output of
/proc/meminfo after each run. The slab counters indicate how much memory is
used by the slabs.
It would also be interesting to see the output of the slabinfo command after
the slub run?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-05 14:59 ` Christoph Lameter
@ 2008-08-06 12:36 ` KOSAKI Motohiro
2008-08-06 14:24 ` Christoph Lameter
2008-08-13 10:46 ` KOSAKI Motohiro
1 sibling, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-06 12:36 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
>>> Can you quantify the difference?
>>
>> machine spec:
>> CPU: IA64 x 8
>> MEM: 8G (4G x2node)
>
> 16k or 64k page size?
64k.
>> So, about 2.5G - 3G difference in 8G mem.
>
> Well not sure if that tells us much. Please show us the output of
> /proc/meminfo after each run. The slab counters indicate how much memory is
> used by the slabs.
>
> It would also be interesting to see the output of the slabinfo command after
> the slub run?
ok.
but i can't do that in this week.
so, I'll do it in next week.
honestly, I don't know how to use slabinfo command :-)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-06 12:36 ` KOSAKI Motohiro
@ 2008-08-06 14:24 ` Christoph Lameter
0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-06 14:24 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
>>>> Can you quantify the difference?
>>> machine spec:
>>> CPU: IA64 x 8
>>> MEM: 8G (4G x2node)
>> 16k or 64k page size?
>
> 64k.
>
>
>>> So, about 2.5G - 3G difference in 8G mem.
>> Well not sure if that tells us much. Please show us the output of
>> /proc/meminfo after each run. The slab counters indicate how much memory is
>> used by the slabs.
>>
>> It would also be interesting to see the output of the slabinfo command after
>> the slub run?
>
> ok.
> but i can't do that in this week.
> so, I'll do it in next week.
>
> honestly, I don't know how to use slabinfo command :-)
Its in linux/Documentation/vm/slabinfo.c
Do
gcc -o slabinfo Documentation/vm/slabinfo.c
./slabinfo
(./slabinfo -h if you are curious and want to use more advanced options)
^ permalink raw reply [flat|nested] 64+ messages in thread
* [patch 11/19] inodes: Support generic defragmentation
2008-08-11 15:06 [patch 00/19] Slab Fragmentation Reduction V14 Christoph Lameter
@ 2008-08-11 15:06 ` Christoph Lameter
0 siblings, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-11 15:06 UTC (permalink / raw)
To: Pekka Enberg
Cc: akpm, Alexander Viro, Christoph Hellwig, Christoph Lameter,
Christoph Lameter, linux-kernel, linux-fsdevel, Mel Gorman, andi,
Rik van Riel, mpm, Dave Chinner
[-- Attachment #1: 0025-inodes-Support-generic-defragmentation.patch --]
[-- Type: text/plain, Size: 5241 bytes --]
This implements the ability to remove inodes in a particular slab
from inode caches. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.
Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
fs/inode.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 6 ++
2 files changed, 129 insertions(+)
Index: linux-next/fs/inode.c
===================================================================
--- linux-next.orig/fs/inode.c 2008-08-11 07:42:10.738607937 -0700
+++ linux-next/fs/inode.c 2008-08-11 07:47:04.342348902 -0700
@@ -1363,6 +1363,128 @@ static int __init set_ihash_entries(char
__setup("ihash_entries=", set_ihash_entries);
/*
+ * Obtain a refcount on a list of struct inodes pointed to by v. If the
+ * inode is in the process of being freed then zap the v[] entry so that
+ * we skip the freeing attempts later.
+ *
+ * This is a generic function for the ->get slab defrag callback.
+ */
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+ int i;
+
+ spin_lock(&inode_lock);
+ for (i = 0; i < nr; i++) {
+ struct inode *inode = v[i];
+
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+ v[i] = NULL;
+ else
+ __iget(inode);
+ }
+ spin_unlock(&inode_lock);
+ return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * fs inode. The offset is the offset of the struct inode in the fs inode.
+ *
+ * The function adds to the pointers in v[] in order to make them point to
+ * struct inode. Then get_inodes() is used to get the refcount.
+ * The converted v[] pointers can then also be passed to the kick() callback
+ * without further processing.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+ unsigned long offset)
+{
+ int i;
+
+ for (i = 0; i < nr; i++)
+ v[i] += offset;
+
+ return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+/*
+ * Generic callback function slab defrag ->kick methods. Takes the
+ * array with inodes where we obtained refcounts using fs_get_inodes()
+ * or get_inodes() and tries to free them.
+ */
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+ struct inode *inode;
+ int i;
+ int abort = 0;
+ LIST_HEAD(freeable);
+ int active;
+
+ for (i = 0; i < nr; i++) {
+ inode = v[i];
+ if (!inode)
+ continue;
+
+ if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ if (remove_inode_buffers(inode))
+ /*
+ * Should we really be doing this? Or
+ * limit the writeback here to only a few pages?
+ *
+ * Possibly an expensive operation but we
+ * cannot reclaim the inode if the pages
+ * are still present.
+ */
+ invalidate_mapping_pages(&inode->i_data,
+ 0, -1);
+ }
+
+ /* Invalidate children and dentry */
+ if (S_ISDIR(inode->i_mode)) {
+ struct dentry *d = d_find_alias(inode);
+
+ if (d) {
+ d_invalidate(d);
+ dput(d);
+ }
+ }
+
+ if (inode->i_state & I_DIRTY)
+ write_inode_now(inode, 1);
+
+ d_prune_aliases(inode);
+ }
+
+ mutex_lock(&iprune_mutex);
+ for (i = 0; i < nr; i++) {
+ inode = v[i];
+
+ if (!inode)
+ /* inode is alrady being freed */
+ continue;
+
+ active = inode->i_sb->s_flags & MS_ACTIVE;
+ iput(inode);
+ if (abort || !active)
+ continue;
+
+ spin_lock(&inode_lock);
+ abort = !can_unuse(inode);
+
+ if (!abort) {
+ list_move(&inode->i_list, &freeable);
+ inode->i_state |= I_FREEING;
+ inodes_stat.nr_unused--;
+ }
+ spin_unlock(&inode_lock);
+ }
+ dispose_list(&freeable);
+ mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+/*
* Initialize the waitqueues and inode hash table.
*/
void __init inode_init_early(void)
@@ -1401,6 +1523,7 @@ void __init inode_init(void)
SLAB_MEM_SPREAD),
init_once);
register_shrinker(&icache_shrinker);
+ kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
/* Hash may have been set up in inode_init_early */
if (!hashdist)
Index: linux-next/include/linux/fs.h
===================================================================
--- linux-next.orig/include/linux/fs.h 2008-08-11 07:42:30.598607988 -0700
+++ linux-next/include/linux/fs.h 2008-08-11 07:47:05.012377598 -0700
@@ -1846,6 +1846,12 @@ static inline void insert_inode_hash(str
__insert_inode_hash(inode, inode->i_ino);
}
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+ unsigned long offset);
+
extern struct file * get_empty_filp(void);
extern void file_move(struct file *f, struct list_head *list);
extern void file_kill(struct file *f);
--
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [patch 14/19] Filesystem: Ext4 filesystem defrag
2008-08-03 1:54 ` Theodore Tso
@ 2008-08-13 7:26 ` Pekka Enberg
0 siblings, 0 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-13 7:26 UTC (permalink / raw)
To: Theodore Tso, Christoph Lameter, Pekka Enberg, akpm,
Christoph Lameter, lin
Theodore Tso wrote:
> On Fri, May 09, 2008 at 07:21:15PM -0700, Christoph Lameter wrote:
>> Support defragmentation for extX filesystem inodes
>
> You forgot to change "extX" to "ext4". :-)
Fixed that up now.
>> Reviewed-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Thanks!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-05 14:59 ` Christoph Lameter
2008-08-06 12:36 ` KOSAKI Motohiro
@ 2008-08-13 10:46 ` KOSAKI Motohiro
2008-08-13 13:10 ` Christoph Lameter
1 sibling, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13 10:46 UTC (permalink / raw)
To: Christoph Lameter
Cc: kosaki.motohiro, KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg,
akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel
> Well not sure if that tells us much. Please show us the output of
> /proc/meminfo after each run. The slab counters indicate how much memory is
> used by the slabs.
>
> It would also be interesting to see the output of the slabinfo command after
> the slub run?
sorry for late responce.
slab use 123M vs slub use 1.5G
Thought?
<slab>
% cat /proc/meminfo
MemTotal: 7701760 kB
MemFree: 5940096 kB
Buffers: 6400 kB
Cached: 27712 kB
SwapCached: 52544 kB
Active: 51520 kB
Inactive: 53248 kB
Active(anon): 26752 kB
Inactive(anon): 41792 kB
Active(file): 24768 kB
Inactive(file): 11456 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2031488 kB
SwapFree: 1958400 kB
Dirty: 192 kB
Writeback: 0 kB
AnonPages: 38400 kB
Mapped: 23232 kB
Slab: 123840 kB
SReclaimable: 30272 kB
SUnreclaim: 93568 kB
PageTables: 10688 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5882368 kB
Committed_AS: 397568 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29184 kB
VmallocChunk: 17592177626240 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB
% cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dm_mpath_io 0 0 40 1488 1 : tunables 120 60 8 : slabdata 0 0 0
dm_snap_tracked_chunk 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
dm_snap_pending_exception 0 0 112 564 1 : tunables 120 60 8 : slabdata 0 0 0
dm_snap_exception 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
kcopyd_job 0 0 408 158 1 : tunables 54 27 8 : slabdata 0 0 0
dm_target_io 515 2338 24 2338 1 : tunables 120 60 8 : slabdata 1 1 0
dm_io 515 1818 32 1818 1 : tunables 120 60 8 : slabdata 1 1 0
scsi_sense_cache 26 496 128 496 1 : tunables 120 60 8 : slabdata 1 1 0
scsi_cmd_cache 26 168 384 168 1 : tunables 54 27 8 : slabdata 1 1 0
uhci_urb_priv 0 0 56 1091 1 : tunables 120 60 8 : slabdata 0 0 0
flow_cache 0 0 96 654 1 : tunables 120 60 8 : slabdata 0 0 0
cfq_io_context 48 760 168 380 1 : tunables 120 60 8 : slabdata 2 2 0
cfq_queue 41 934 136 467 1 : tunables 120 60 8 : slabdata 2 2 0
mqueue_inode_cache 1 56 1152 56 1 : tunables 24 12 8 : slabdata 1 1 0
fat_inode_cache 1 77 840 77 1 : tunables 54 27 8 : slabdata 1 1 0
fat_cache 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
hugetlbfs_inode_cache 1 83 776 83 1 : tunables 54 27 8 : slabdata 1 1 0
ext2_inode_cache 0 0 1024 63 1 : tunables 54 27 8 : slabdata 0 0 0
ext2_xattr 0 0 88 711 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_journal_handle 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_journal_head 0 0 96 654 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_revoke_table 0 0 16 3274 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_revoke_record 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
journal_handle 48 4676 24 2338 1 : tunables 120 60 8 : slabdata 2 2 0
journal_head 41 1308 96 654 1 : tunables 120 60 8 : slabdata 2 2 0
revoke_table 4 3274 16 3274 1 : tunables 120 60 8 : slabdata 1 1 0
revoke_record 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
ext4_inode_cache 0 0 1192 54 1 : tunables 24 12 8 : slabdata 0 0 0
ext4_xattr 0 0 88 711 1 : tunables 120 60 8 : slabdata 0 0 0
ext4_alloc_context 0 0 168 380 1 : tunables 120 60 8 : slabdata 0 0 0
ext4_prealloc_space 0 0 120 528 1 : tunables 120 60 8 : slabdata 0 0 0
ext3_inode_cache 367 5696 1016 64 1 : tunables 54 27 8 : slabdata 89 89 0
ext3_xattr 99 1422 88 711 1 : tunables 120 60 8 : slabdata 2 2 0
dnotify_cache 1 1488 40 1488 1 : tunables 120 60 8 : slabdata 1 1 0
kioctx 0 0 384 168 1 : tunables 54 27 8 : slabdata 0 0 0
kiocb 0 0 256 251 1 : tunables 120 60 8 : slabdata 0 0 0
inotify_event_cache 0 0 40 1488 1 : tunables 120 60 8 : slabdata 0 0 0
inotify_watch_cache 1 861 72 861 1 : tunables 120 60 8 : slabdata 1 1 0
fasync_cache 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
shmem_inode_cache 864 1105 1000 65 1 : tunables 54 27 8 : slabdata 17 17 0
pid_namespace 0 0 184 348 1 : tunables 120 60 8 : slabdata 0 0 0
nsproxy 0 0 56 1091 1 : tunables 120 60 8 : slabdata 0 0 0
posix_timers_cache 0 0 184 348 1 : tunables 120 60 8 : slabdata 0 0 0
uid_cache 6 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
ia64_partial_page_cache 0 0 48 1259 1 : tunables 120 60 8 : slabdata 0 0 0
UNIX 32 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
UDP-Lite 0 0 1024 63 1 : tunables 54 27 8 : slabdata 0 0 0
tcp_bind_bucket 4 1924 64 962 1 : tunables 120 60 8 : slabdata 2 2 0
inet_peer_cache 0 0 64 962 1 : tunables 120 60 8 : slabdata 0 0 0
secpath_cache 0 0 64 962 1 : tunables 120 60 8 : slabdata 0 0 0
xfrm_dst_cache 0 0 384 168 1 : tunables 54 27 8 : slabdata 0 0 0
ip_fib_alias 3 1818 32 1818 1 : tunables 120 60 8 : slabdata 1 1 0
ip_fib_hash 15 1722 72 861 1 : tunables 120 60 8 : slabdata 2 2 0
ip_dst_cache 50 336 384 168 1 : tunables 54 27 8 : slabdata 2 2 0
arp_cache 1 251 256 251 1 : tunables 120 60 8 : slabdata 1 1 0
RAW 129 216 896 72 1 : tunables 54 27 8 : slabdata 3 3 0
UDP 9 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
tw_sock_TCP 0 0 256 251 1 : tunables 120 60 8 : slabdata 0 0 0
request_sock_TCP 0 0 128 496 1 : tunables 120 60 8 : slabdata 0 0 0
TCP 5 72 1792 36 1 : tunables 24 12 8 : slabdata 2 2 0
eventpoll_pwq 0 0 72 861 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_epi 0 0 128 496 1 : tunables 120 60 8 : slabdata 0 0 0
sgpool-128 2 30 4096 15 1 : tunables 24 12 8 : slabdata 2 2 0
sgpool-64 2 62 2048 31 1 : tunables 24 12 8 : slabdata 2 2 0
sgpool-32 2 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
sgpool-16 2 252 512 126 1 : tunables 54 27 8 : slabdata 2 2 0
sgpool-8 18 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
scsi_data_buffer 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
scsi_io_context 0 0 112 564 1 : tunables 120 60 8 : slabdata 0 0 0
blkdev_queue 26 70 1864 35 1 : tunables 24 12 8 : slabdata 2 2 0
blkdev_requests 44 212 304 212 1 : tunables 54 27 8 : slabdata 1 1 0
blkdev_ioc 38 1308 96 654 1 : tunables 120 60 8 : slabdata 2 2 0
biovec-256 34 60 4096 15 1 : tunables 24 12 8 : slabdata 4 4 0
biovec-128 34 93 2048 31 1 : tunables 24 12 8 : slabdata 3 3 0
biovec-64 34 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
biovec-16 34 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
biovec-4 34 1924 64 962 1 : tunables 120 60 8 : slabdata 2 2 0
biovec-1 37 6548 16 3274 1 : tunables 120 60 8 : slabdata 2 2 0
bio 37 992 128 496 1 : tunables 120 60 8 : slabdata 2 2 0
sock_inode_cache 188 288 896 72 1 : tunables 54 27 8 : slabdata 4 4 0
skbuff_fclone_cache 16 126 512 126 1 : tunables 54 27 8 : slabdata 1 1 0
skbuff_head_cache 1812 11546 256 251 1 : tunables 120 60 8 : slabdata 46 46 0
file_lock_cache 4 668 192 334 1 : tunables 120 60 8 : slabdata 2 2 0
Acpi-Operand 24947 26691 72 861 1 : tunables 120 60 8 : slabdata 31 31 0
Acpi-ParseExt 0 0 72 861 1 : tunables 120 60 8 : slabdata 0 0 0
Acpi-Parse 0 0 48 1259 1 : tunables 120 60 8 : slabdata 0 0 0
Acpi-State 0 0 80 779 1 : tunables 120 60 8 : slabdata 0 0 0
Acpi-Namespace 18877 21816 32 1818 1 : tunables 120 60 8 : slabdata 12 12 0
page_cgroup 1183 142848 40 1488 1 : tunables 120 60 8 : slabdata 96 96 0
proc_inode_cache 197 902 792 82 1 : tunables 54 27 8 : slabdata 11 11 0
sigqueue 0 0 160 399 1 : tunables 120 60 8 : slabdata 0 0 0
radix_tree_node 719 7254 552 117 1 : tunables 54 27 8 : slabdata 62 62 0
bdev_cache 30 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
sysfs_dir_cache 11089 12464 80 779 1 : tunables 120 60 8 : slabdata 16 16 0
mnt_cache 24 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
inode_cache 54 696 744 87 1 : tunables 54 27 8 : slabdata 8 8 0
dentry 1577 17794 224 287 1 : tunables 120 60 8 : slabdata 62 62 0
filp 706 3765 256 251 1 : tunables 120 60 8 : slabdata 15 15 0
names_cache 46 105 4096 15 1 : tunables 24 12 8 : slabdata 7 7 0
buffer_head 3557 125442 104 606 1 : tunables 120 60 8 : slabdata 207 207 0
mm_struct 76 288 896 72 1 : tunables 54 27 8 : slabdata 4 4 0
vm_area_struct 1340 2178 176 363 1 : tunables 120 60 8 : slabdata 6 6 36
fs_cache 61 992 128 496 1 : tunables 120 60 8 : slabdata 2 2 0
files_cache 62 336 768 84 1 : tunables 54 27 8 : slabdata 4 4 0
signal_cache 161 588 768 84 1 : tunables 54 27 8 : slabdata 7 7 0
sighand_cache 157 390 1664 39 1 : tunables 24 12 8 : slabdata 10 10 0
anon_vma 657 2976 40 1488 1 : tunables 120 60 8 : slabdata 2 2 0
pid 160 992 128 496 1 : tunables 120 60 8 : slabdata 2 2 0
shared_policy_node 0 0 48 1259 1 : tunables 120 60 8 : slabdata 0 0 0
numa_policy 7 244 264 244 1 : tunables 54 27 8 : slabdata 1 1 0
idr_layer_cache 150 476 544 119 1 : tunables 54 27 8 : slabdata 4 4 0
size-33554432(DMA) 0 0 33554432 1 512 : tunables 1 1 0 : slabdata 0 0 0
size-33554432 0 0 33554432 1 512 : tunables 1 1 0 : slabdata 0 0 0
size-16777216(DMA) 0 0 16777216 1 256 : tunables 1 1 0 : slabdata 0 0 0
size-16777216 0 0 16777216 1 256 : tunables 1 1 0 : slabdata 0 0 0
size-8388608(DMA) 0 0 8388608 1 128 : tunables 1 1 0 : slabdata 0 0 0
size-8388608 0 0 8388608 1 128 : tunables 1 1 0 : slabdata 0 0 0
size-4194304(DMA) 0 0 4194304 1 64 : tunables 1 1 0 : slabdata 0 0 0
size-4194304 0 0 4194304 1 64 : tunables 1 1 0 : slabdata 0 0 0
size-2097152(DMA) 0 0 2097152 1 32 : tunables 1 1 0 : slabdata 0 0 0
size-2097152 0 0 2097152 1 32 : tunables 1 1 0 : slabdata 0 0 0
size-1048576(DMA) 0 0 1048576 1 16 : tunables 1 1 0 : slabdata 0 0 0
size-1048576 0 0 1048576 1 16 : tunables 1 1 0 : slabdata 0 0 0
size-524288(DMA) 0 0 524288 1 8 : tunables 1 1 0 : slabdata 0 0 0
size-524288 0 0 524288 1 8 : tunables 1 1 0 : slabdata 0 0 0
size-262144(DMA) 0 0 262144 1 4 : tunables 1 1 0 : slabdata 0 0 0
size-262144 0 0 262144 1 4 : tunables 1 1 0 : slabdata 0 0 0
size-131072(DMA) 0 0 131072 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-131072 1 1 131072 1 2 : tunables 8 4 0 : slabdata 1 1 0
size-65536(DMA) 0 0 65536 1 1 : tunables 24 12 8 : slabdata 0 0 0
size-65536 4 4 65536 1 1 : tunables 24 12 8 : slabdata 4 4 0
size-32768(DMA) 0 0 32768 2 1 : tunables 24 12 8 : slabdata 0 0 0
size-32768 12 14 32768 2 1 : tunables 24 12 8 : slabdata 7 7 0
size-16384(DMA) 0 0 16384 4 1 : tunables 24 12 8 : slabdata 0 0 0
size-16384 15 28 16384 4 1 : tunables 24 12 8 : slabdata 7 7 0
size-8192(DMA) 0 0 8192 8 1 : tunables 24 12 8 : slabdata 0 0 0
size-8192 2455 2472 8192 8 1 : tunables 24 12 8 : slabdata 309 309 0
size-4096(DMA) 0 0 4096 15 1 : tunables 24 12 8 : slabdata 0 0 0
size-4096 1607 1665 4096 15 1 : tunables 24 12 8 : slabdata 111 111 0
size-2048(DMA) 0 0 2048 31 1 : tunables 24 12 8 : slabdata 0 0 0
size-2048 2706 2914 2048 31 1 : tunables 24 12 8 : slabdata 94 94 0
size-1024(DMA) 0 0 1024 63 1 : tunables 54 27 8 : slabdata 0 0 0
size-1024 2414 2583 1024 63 1 : tunables 54 27 8 : slabdata 41 41 0
size-512(DMA) 0 0 512 126 1 : tunables 54 27 8 : slabdata 0 0 0
size-512 1805 2142 512 126 1 : tunables 54 27 8 : slabdata 17 17 0
size-256(DMA) 0 0 256 251 1 : tunables 120 60 8 : slabdata 0 0 0
size-256 44889 48945 256 251 1 : tunables 120 60 8 : slabdata 195 195 0
size-128(DMA) 0 0 128 496 1 : tunables 120 60 8 : slabdata 0 0 0
size-64(DMA) 0 0 64 962 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 28119 30256 128 496 1 : tunables 120 60 8 : slabdata 61 61 0
size-64 14597 22126 64 962 1 : tunables 120 60 8 : slabdata 23 23 0
kmem_cache 151 155 12416 5 1 : tunables 24 12 8 : slabdata 31 31 0
<SLUB>
% cat /proc/meminfo
MemTotal: 7701376 kB
MemFree: 4740928 kB
Buffers: 4544 kB
Cached: 35584 kB
SwapCached: 0 kB
Active: 119104 kB
Inactive: 9920 kB
Active(anon): 90240 kB
Inactive(anon): 0 kB
Active(file): 28864 kB
Inactive(file): 9920 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2031488 kB
SwapFree: 2031488 kB
Dirty: 64 kB
Writeback: 0 kB
AnonPages: 89152 kB
Mapped: 31232 kB
Slab: 1591680 kB
SReclaimable: 12608 kB
SUnreclaim: 1579072 kB
PageTables: 11904 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5882176 kB
Committed_AS: 446848 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29056 kB
VmallocChunk: 17592177626432 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB
% cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kcopyd_job 0 0 408 160 1 : tunables 0 0 0 : slabdata 0 0 0
cfq_io_context 3120 3120 168 390 1 : tunables 0 0 0 : slabdata 8 8 0
cfq_queue 3848 3848 136 481 1 : tunables 0 0 0 : slabdata 8 8 0
mqueue_inode_cache 56 56 1152 56 1 : tunables 0 0 0 : slabdata 1 1 0
fat_inode_cache 77 77 848 77 1 : tunables 0 0 0 : slabdata 1 1 0
fat_cache 0 0 40 1638 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 83 83 784 83 1 : tunables 0 0 0 : slabdata 1 1 0
ext2_inode_cache 0 0 1032 63 1 : tunables 0 0 0 : slabdata 0 0 0
journal_handle 21840 21840 24 2730 1 : tunables 0 0 0 : slabdata 8 8 0
journal_head 4774 4774 96 682 1 : tunables 0 0 0 : slabdata 7 7 0
revoke_table 4096 4096 16 4096 1 : tunables 0 0 0 : slabdata 1 1 0
revoke_record 2048 2048 32 2048 1 : tunables 0 0 0 : slabdata 1 1 0
ext4_inode_cache 0 0 1200 54 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_alloc_context 0 0 168 390 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_prealloc_space 0 0 120 546 1 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 750 2624 1024 64 1 : tunables 0 0 0 : slabdata 41 41 0
ext3_xattr 4464 4464 88 744 1 : tunables 0 0 0 : slabdata 6 6 0
shmem_inode_cache 1256 1365 1008 65 1 : tunables 0 0 0 : slabdata 21 21 0
nsproxy 0 0 56 1170 1 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 184 356 1 : tunables 0 0 0 : slabdata 0 0 0
ip_dst_cache 1360 1360 384 170 1 : tunables 0 0 0 : slabdata 8 8 0
TCP 180 180 1792 36 1 : tunables 0 0 0 : slabdata 5 5 0
scsi_data_buffer 21840 21840 24 2730 1 : tunables 0 0 0 : slabdata 8 8 0
scsi_io_context 0 0 112 585 1 : tunables 0 0 0 : slabdata 0 0 0
blkdev_queue 140 140 1864 70 2 : tunables 0 0 0 : slabdata 2 2 0
blkdev_requests 1720 1720 304 215 1 : tunables 0 0 0 : slabdata 8 8 0
sock_inode_cache 758 949 896 73 1 : tunables 0 0 0 : slabdata 13 13 0
file_lock_cache 2289 2289 200 327 1 : tunables 0 0 0 : slabdata 7 7 0
Acpi-ParseExt 29117 29120 72 910 1 : tunables 0 0 0 : slabdata 32 32 0
page_cgroup 14660 24570 40 1638 1 : tunables 0 0 0 : slabdata 15 15 0
proc_inode_cache 732 810 800 81 1 : tunables 0 0 0 : slabdata 10 10 0
sigqueue 3272 3272 160 409 1 : tunables 0 0 0 : slabdata 8 8 0
radix_tree_node 1200 1755 560 117 1 : tunables 0 0 0 : slabdata 15 15 0
bdev_cache 256 256 1024 64 1 : tunables 0 0 0 : slabdata 4 4 0
sysfs_dir_cache 16376 16380 80 819 1 : tunables 0 0 0 : slabdata 20 20 0
inode_cache 707 957 752 87 1 : tunables 0 0 0 : slabdata 11 11 0
dentry 3503 11096 224 292 1 : tunables 0 0 0 : slabdata 38 38 0
buffer_head 6920 23985 112 585 1 : tunables 0 0 0 : slabdata 41 41 0
mm_struct 741 1022 896 73 1 : tunables 0 0 0 : slabdata 14 14 0
vm_area_struct 4015 5208 176 372 1 : tunables 0 0 0 : slabdata 14 14 0
signal_cache 801 1020 768 85 1 : tunables 0 0 0 : slabdata 12 12 0
sighand_cache 433 546 1664 39 1 : tunables 0 0 0 : slabdata 14 14 0
anon_vma 10920 10920 48 1365 1 : tunables 0 0 0 : slabdata 8 8 0
shared_policy_node 5460 5460 48 1365 1 : tunables 0 0 0 : slabdata 4 4 0
numa_policy 248 248 264 248 1 : tunables 0 0 0 : slabdata 1 1 0
idr_layer_cache 944 944 552 118 1 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-65536 32 32 65536 4 4 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-32768 128 128 32768 16 8 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-16384 160 160 16384 32 8 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-8192 448 448 8192 64 8 : tunables 0 0 0 : slabdata 7 7 0
kmalloc-4096 819 14336 4096 64 4 : tunables 0 0 0 : slabdata 224 224 0
kmalloc-2048 2409 8384 2048 64 2 : tunables 0 0 0 : slabdata 131 131 0
kmalloc-1024 1848 14912 1024 64 1 : tunables 0 0 0 : slabdata 233 233 0
kmalloc-512 2306 2432 512 128 1 : tunables 0 0 0 : slabdata 19 19 0
kmalloc-256 13919 123904 256 256 1 : tunables 0 0 0 : slabdata 484 484 0
kmalloc-128 28739 10747904 128 512 1 : tunables 0 0 0 : slabdata 20992 20992 0
kmalloc-64 10224 10240 64 1024 1 : tunables 0 0 0 : slabdata 10 10 0
kmalloc-32 34806 34816 32 2048 1 : tunables 0 0 0 : slabdata 17 17 0
kmalloc-16 32768 32768 16 4096 1 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-8 65536 65536 8 8192 1 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-192 4609 447051 192 341 1 : tunables 0 0 0 : slabdata 1311 1311 0
kmalloc-96 5456 5456 96 682 1 : tunables 0 0 0 : slabdata 8 8 0
kmem_cache_node 3276 3276 80 819 1 : tunables 0 0 0 : slabdata 4 4 0
% slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
:at-0000016 4096 16 65.5K 0/0/1 4096 0 0 100 *a
:at-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *a
:at-0000032 2048 32 65.5K 0/0/1 2048 0 0 100 *Aa
:at-0000088 4464 88 393.2K 0/0/6 744 0 0 99 *a
:at-0000096 4774 96 458.7K 0/0/7 682 0 0 99 *a
:t-0000016 32768 16 524.2K 0/0/8 4096 0 0 100 *
:t-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *
:t-0000032 34806 32 1.1M 9/1/8 2048 0 5 99 *
:t-0000040 14660 40 983.0K 7/7/8 1638 0 46 59 *
:t-0000048 5460 48 262.1K 0/0/4 1365 0 0 99 *
:t-0000064 10224 64 655.3K 2/1/8 1024 0 10 99 *
:t-0000072 29117 72 2.0M 26/2/6 910 0 6 99 *
:t-0000080 16376 80 1.3M 12/1/8 819 0 5 99 *
:t-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *
:t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
:t-0000256 15285 256 31.7M 476/438/8 256 0 90 12 *
:t-0000384 1360 352 524.2K 0/0/8 170 0 0 91 *A
:t-0000512 2306 512 1.2M 11/3/8 128 0 15 94 *
:t-0000768 801 768 786.4K 4/4/8 85 0 33 78 *A
:t-0000896 741 880 917.5K 6/5/8 73 0 35 71 *A
:t-0001024 1848 1024 15.2M 225/214/8 64 0 91 12 *
:t-0002048 2406 2048 17.1M 123/115/8 64 1 87 28 *
:t-0004096 819 4096 58.7M 216/216/8 64 2 96 5 *
anon_vma 10920 40 524.2K 0/0/8 1365 0 0 83
bdev_cache 256 1008 262.1K 0/0/4 64 0 0 98 Aa
blkdev_queue 140 1864 262.1K 0/0/2 70 1 0 99
blkdev_requests 1720 304 524.2K 0/0/8 215 0 0 99
buffer_head 7493 104 2.6M 33/32/8 585 0 78 29 a
cfq_io_context 3120 168 524.2K 0/0/8 390 0 0 99
cfq_queue 3848 136 524.2K 0/0/8 481 0 0 99
dentry 3793 224 2.4M 30/29/8 292 0 76 34 a
ext3_inode_cache 750 1016 2.6M 33/33/8 64 0 80 28 a
fat_inode_cache 77 840 65.5K 0/0/1 77 0 0 98 a
file_lock_cache 2289 192 458.7K 0/0/7 327 0 0 95
hugetlbfs_inode_cache 83 776 65.5K 0/0/1 83 0 0 98
idr_layer_cache 944 544 524.2K 0/0/8 118 0 0 97
inode_cache 1044 744 786.4K 4/0/8 87 0 0 98 a
kmalloc-16384 160 16384 2.6M 0/0/5 32 3 0 100
kmalloc-192 4609 192 85.9M 1303/1303/8 341 0 99 1
kmalloc-32768 128 32768 4.1M 0/0/8 16 3 0 100
kmalloc-65536 32 65536 2.0M 0/0/8 4 2 0 100
kmalloc-8 65536 8 524.2K 0/0/8 8192 0 0 100
kmalloc-8192 448 8192 3.6M 0/0/7 64 3 0 100
kmem_cache_node 3276 80 262.1K 0/0/4 819 0 0 99 *
mqueue_inode_cache 56 1064 65.5K 0/0/1 56 0 0 90 A
numa_policy 248 264 65.5K 0/0/1 248 0 0 99
proc_inode_cache 732 792 655.3K 2/1/8 81 0 10 88 a
radix_tree_node 1200 552 983.0K 7/7/8 117 0 46 67 a
shmem_inode_cache 1256 1000 1.3M 13/4/8 65 0 19 91
sighand_cache 433 1608 917.5K 6/4/8 39 0 28 75 A
sigqueue 3272 160 524.2K 0/0/8 409 0 0 99
sock_inode_cache 758 832 851.9K 5/4/8 73 0 30 74 Aa
TCP 180 1712 327.6K 0/0/5 36 0 0 94 A
vm_area_struct 4015 176 917.5K 6/6/8 372 0 42 77
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 10:46 ` KOSAKI Motohiro
@ 2008-08-13 13:10 ` Christoph Lameter
2008-08-13 14:14 ` KOSAKI Motohiro
2008-08-14 7:15 ` Pekka Enberg
0 siblings, 2 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-13 13:10 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KOSAKI Motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
> <SLUB>
>
> % cat /proc/meminfo
>
> Slab: 1591680 kB
> SReclaimable: 12608 kB
> SUnreclaim: 1579072 kB
Unreclaimable grew very big.
> :t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
Argh. Most slabs contain a single object. Probably due to the conflict resolution.
> kmalloc-192 4609 192 85.9M 1303/1303/8 341 0 99 1
And a similar but not so severe issue here.
The obvious fix is to avoid allocating another slab on conflict but how will
this impact performance?
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
@@ -1253,13 +1253,11 @@
static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
struct page *page)
{
- if (slab_trylock(page)) {
- list_del(&page->lru);
- n->nr_partial--;
- __SetPageSlubFrozen(page);
- return 1;
- }
- return 0;
+ slab_lock(page);
+ list_del(&page->lru);
+ n->nr_partial--;
+ __SetPageSlubFrozen(page);
+ return 1;
}
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 13:10 ` Christoph Lameter
@ 2008-08-13 14:14 ` KOSAKI Motohiro
2008-08-13 14:16 ` Pekka Enberg
2008-08-13 14:31 ` Christoph Lameter
2008-08-14 7:15 ` Pekka Enberg
1 sibling, 2 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13 14:14 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
>> :t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
>
> Argh. Most slabs contain a single object. Probably due to the conflict resolution.
agreed with the issue exist in lock contention code.
> The obvious fix is to avoid allocating another slab on conflict but how will
> this impact performance?
>
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
> +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> @@ -1253,13 +1253,11 @@
> static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> struct page *page)
> {
> - if (slab_trylock(page)) {
> - list_del(&page->lru);
> - n->nr_partial--;
> - __SetPageSlubFrozen(page);
> - return 1;
> - }
> - return 0;
> + slab_lock(page);
> + list_del(&page->lru);
> + n->nr_partial--;
> + __SetPageSlubFrozen(page);
> + return 1;
> }
I don't mesure it yet. I don't like this patch.
maybe, it decrease other typical benchmark.
So, I think better way is
1. slab_trylock(), if success goto 10.
2. check fragmentation ratio, if low goto 10
3. slab_lock()
10. return func
I think this way doesn't cause performance regression.
because high fragmentation cause defrag and compaction lately.
So, prevent fragmentation often increase performance.
Thought?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 14:14 ` KOSAKI Motohiro
@ 2008-08-13 14:16 ` Pekka Enberg
2008-08-13 14:31 ` Christoph Lameter
1 sibling, 0 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-13 14:16 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Christoph Lameter, Matthew Wilcox, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
On Wed, 2008-08-13 at 23:14 +0900, KOSAKI Motohiro wrote:
> >> :t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
> >
> > Argh. Most slabs contain a single object. Probably due to the conflict resolution.
>
> agreed with the issue exist in lock contention code.
>
>
> > The obvious fix is to avoid allocating another slab on conflict but how will
> > this impact performance?
> >
> >
> > Index: linux-2.6/mm/slub.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
> > +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> > @@ -1253,13 +1253,11 @@
> > static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> > struct page *page)
> > {
> > - if (slab_trylock(page)) {
> > - list_del(&page->lru);
> > - n->nr_partial--;
> > - __SetPageSlubFrozen(page);
> > - return 1;
> > - }
> > - return 0;
> > + slab_lock(page);
> > + list_del(&page->lru);
> > + n->nr_partial--;
> > + __SetPageSlubFrozen(page);
> > + return 1;
> > }
>
> I don't mesure it yet. I don't like this patch.
> maybe, it decrease other typical benchmark.
>
> So, I think better way is
>
> 1. slab_trylock(), if success goto 10.
> 2. check fragmentation ratio, if low goto 10
> 3. slab_lock()
> 10. return func
>
> I think this way doesn't cause performance regression.
> because high fragmentation cause defrag and compaction lately.
> So, prevent fragmentation often increase performance.
>
> Thought?
I guess that would work. But how exactly would you quantify
"fragmentation ratio?"
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 14:14 ` KOSAKI Motohiro
2008-08-13 14:16 ` Pekka Enberg
@ 2008-08-13 14:31 ` Christoph Lameter
2008-08-13 15:05 ` KOSAKI Motohiro
1 sibling, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-13 14:31 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
>
> I don't mesure it yet. I don't like this patch.
> maybe, it decrease other typical benchmark.
Yes but running with this patch would allow us to verify that we understand
what is causing the problem. There are other solutions like skipping to the
next partial slab on the list that could fix performance issues that the patch
may cause. A test will give us:
1. Confirmation that the memory use is caused by the trylock.
2. Some performance numbers. If these show a regression then we have some
markers that we can measure other solutions against.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 14:31 ` Christoph Lameter
@ 2008-08-13 15:05 ` KOSAKI Motohiro
2008-08-14 19:44 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13 15:05 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
> Yes but running with this patch would allow us to verify that we understand
> what is causing the problem. There are other solutions like skipping to the
> next partial slab on the list that could fix performance issues that the patch
> may cause. A test will give us:
>
> 1. Confirmation that the memory use is caused by the trylock.
>
> 2. Some performance numbers. If these show a regression then we have some
> markers that we can measure other solutions against.
okey.
I will confirm its patch at next week.
(unfortunately, my company don't business in rest this week)
Thanks.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 13:10 ` Christoph Lameter
2008-08-13 14:14 ` KOSAKI Motohiro
@ 2008-08-14 7:15 ` Pekka Enberg
2008-08-14 14:45 ` Christoph Lameter
2008-08-14 15:06 ` Christoph Lameter
1 sibling, 2 replies; 64+ messages in thread
From: Pekka Enberg @ 2008-08-14 7:15 UTC (permalink / raw)
To: Christoph Lameter
Cc: KOSAKI Motohiro, KOSAKI Motohiro, Matthew Wilcox, akpm,
linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel
Hi Christoph,
Christoph Lameter wrote:
> The obvious fix is to avoid allocating another slab on conflict but how will
> this impact performance?
>
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
> +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> @@ -1253,13 +1253,11 @@
> static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> struct page *page)
> {
> - if (slab_trylock(page)) {
> - list_del(&page->lru);
> - n->nr_partial--;
> - __SetPageSlubFrozen(page);
> - return 1;
> - }
> - return 0;
> + slab_lock(page);
> + list_del(&page->lru);
> + n->nr_partial--;
> + __SetPageSlubFrozen(page);
> + return 1;
> }
This patch hard locks on my 2-way 64-bit x86 machine (sysrq doesn't
respond) when I run hackbench.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-14 7:15 ` Pekka Enberg
@ 2008-08-14 14:45 ` Christoph Lameter
2008-08-14 15:06 ` Christoph Lameter
1 sibling, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-14 14:45 UTC (permalink / raw)
To: Pekka Enberg
Cc: KOSAKI Motohiro, KOSAKI Motohiro, Matthew Wilcox, akpm,
linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel
Pekka Enberg wrote:
> This patch hard locks on my 2-way 64-bit x86 machine (sysrq doesn't
> respond) when I run hackbench.
Hmmm.. Then the issue may be different than we thought. Lock may be
taken recursively in some situations.
Can you enable lockdep?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-14 7:15 ` Pekka Enberg
2008-08-14 14:45 ` Christoph Lameter
@ 2008-08-14 15:06 ` Christoph Lameter
1 sibling, 0 replies; 64+ messages in thread
From: Christoph Lameter @ 2008-08-14 15:06 UTC (permalink / raw)
To: Pekka Enberg
Cc: KOSAKI Motohiro, KOSAKI Motohiro, Matthew Wilcox, akpm,
linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel
Pekka Enberg wrote:
>
> This patch hard locks on my 2-way 64-bit x86 machine (sysrq doesn't
> respond) when I run hackbench.
At that point we take the listlock and then the slab lock which is a
lock inversion if we do not use a trylock here. Crap.
Hmmm.. The code already goes to the next slab if an earlier one is
already locked. So I do not see how the large partial lists could be
generated.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-13 15:05 ` KOSAKI Motohiro
@ 2008-08-14 19:44 ` Christoph Lameter
2008-08-15 16:44 ` KOSAKI Motohiro
0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-14 19:44 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
This is a NUMA system right? Then we have another mechanism that will avoid
off node memory references by allocating new slabs. Can you set the
node_defrag parameter to 0? (Noted by Adrian).
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-14 19:44 ` Christoph Lameter
@ 2008-08-15 16:44 ` KOSAKI Motohiro
2008-08-15 18:24 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-15 16:44 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
> This is a NUMA system right?
True.
My system is
CPU: ia64 x8
MEM: 8G (4G x 2node)
> Then we have another mechanism that will avoid
> off node memory references by allocating new slabs. Can you set the
> node_defrag parameter to 0? (Noted by Adrian).
Please let me know that operations ?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-15 16:44 ` KOSAKI Motohiro
@ 2008-08-15 18:24 ` Christoph Lameter
2008-08-15 19:42 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-15 18:24 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
>> Then we have another mechanism that will avoid
>> off node memory references by allocating new slabs. Can you set the
>> node_defrag parameter to 0? (Noted by Adrian).
>
> Please let me know that operations ?
The control over the preferences of node local vs. remote defrag is occurring
via /sys/kernel/slab/<slabcache>/remote_node_defrag ratio. Default is 10%.
Comments in get_any_partial explain the operations.
The default setting means that in 9 out of 10 cases slub will prefer creating
a new slab over taking one from the remote node (meaning the memory is node
local, probably not important in your 2 node case). It will therefore waste
memory because local memory may be more efficient to use.
Setting remote_node_defrag_ratio to 100 will make slub always take the remote
slab instead of allocating a new one.
/*
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations. A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes.
*
* If the defrag_ratio is set to 0 then kmalloc() always
* returns node local objects. If the ratio is higher then kmalloc()
* may return off node objects because partial slabs are obtained
* from other nodes and filled up.
*
* If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
* defrag_ratio = 1000) then every (well almost) allocation will
* first attempt to defrag slab caches on other nodes. This means
* scanning over all nodes to look for partial slabs which may be
* expensive if we do it every time we are trying to find a slab
* with available objects.
*/
if (!s->remote_node_defrag_ratio ||
get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-15 18:24 ` Christoph Lameter
@ 2008-08-15 19:42 ` Christoph Lameter
2008-08-18 10:08 ` KOSAKI Motohiro
0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-15 19:42 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
Christoph Lameter wrote:
> Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> slab instead of allocating a new one.
As pointed out by Adrian D. off list:
The max remote_node_defrag_ratio is 99.
Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
allow 100 to switch off any node local allocs?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-15 19:42 ` Christoph Lameter
@ 2008-08-18 10:08 ` KOSAKI Motohiro
2008-08-18 10:34 ` KOSAKI Motohiro
0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-18 10:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
> Christoph Lameter wrote:
>
> > Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> > slab instead of allocating a new one.
>
> As pointed out by Adrian D. off list:
>
> The max remote_node_defrag_ratio is 99.
>
> Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
> allow 100 to switch off any node local allocs?
Hmmm,
it doesn't change any behavior.
I did ..
1. slub code change (see below)
Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4056,7 +4056,7 @@ static ssize_t remote_node_defrag_ratio_
if (err)
return err;
- if (ratio < 100)
+ if (ratio <= 100)
s->remote_node_defrag_ratio = ratio * 10;
return length;
2. change remote defrag ratio
# echo 100 > /sys/kernel/slab/:t-0000128/remote_node_defrag_ratio
# cat /sys/kernel/slab/:t-0000128/remote_node_defrag_ratio
100
3. ran hackbench
4. ./slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
:at-0000016 4096 16 65.5K 0/0/1 4096 0 0 100 *a
:at-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *a
:at-0000032 2048 32 65.5K 0/0/1 2048 0 0 100 *Aa
:at-0000088 4464 88 393.2K 0/0/6 744 0 0 99 *a
:at-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *a
:t-0000016 32768 16 524.2K 0/0/8 4096 0 0 100 *
:t-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *
:t-0000032 34806 32 1.1M 9/1/8 2048 0 5 99 *
:t-0000040 14417 40 917.5K 6/6/8 1638 0 42 62 *
:t-0000048 5460 48 262.1K 0/0/4 1365 0 0 99 *
:t-0000064 10224 64 655.3K 2/1/8 1024 0 10 99 *
:t-0000072 29120 72 2.0M 26/0/6 910 0 0 99 *
:t-0000080 16376 80 1.3M 12/1/8 819 0 5 99 *
:t-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *
:t-0000128 28917 128 1.3G 21041/21041/8 512 0 99 0 *
:t-0000256 15280 256 31.4M 472/436/8 256 0 90 12 *
:t-0000384 1360 352 524.2K 0/0/8 170 0 0 91 *A
:t-0000512 2388 512 1.3M 12/4/8 128 0 20 93 *
:t-0000768 851 768 851.9K 5/5/8 85 0 38 76 *A
:t-0000896 742 880 851.9K 5/4/8 73 0 30 76 *A
:t-0001024 1819 1024 15.1M 223/211/8 64 0 91 12 *
:t-0002048 2641 2048 17.9M 129/116/8 64 1 84 30 *
:t-0004096 817 4096 57.1M 210/210/8 64 2 96 5 *
anon_vma 10920 40 524.2K 0/0/8 1365 0 0 83
bdev_cache 256 1008 262.1K 0/0/4 64 0 0 98 Aa
blkdev_queue 140 1864 262.1K 0/0/2 70 1 0 99
blkdev_requests 1720 304 524.2K 0/0/8 215 0 0 99
buffer_head 7284 104 2.5M 31/30/8 585 0 76 29 a
cfq_io_context 3120 168 524.2K 0/0/8 390 0 0 99
cfq_queue 3848 136 524.2K 0/0/8 481 0 0 99
dentry 3775 224 2.5M 31/29/8 292 0 74 33 a
ext3_inode_cache 740 1016 2.4M 30/30/8 64 0 78 30 a
fat_inode_cache 77 840 65.5K 0/0/1 77 0 0 98 a
file_lock_cache 2616 192 524.2K 0/0/8 327 0 0 95
hugetlbfs_inode_cache 83 776 65.5K 0/0/1 83 0 0 98
idr_layer_cache 944 544 524.2K 0/0/8 118 0 0 97
inode_cache 1050 744 851.9K 5/1/8 87 0 7 91 a
kmalloc-16384 160 16384 2.6M 0/0/5 32 3 0 100
kmalloc-192 4578 192 87.5M 1328/1328/8 341 0 99 1
kmalloc-32768 128 32768 4.1M 0/0/8 16 3 0 100
kmalloc-65536 32 65536 2.0M 0/0/8 4 2 0 100
kmalloc-8 65536 8 524.2K 0/0/8 8192 0 0 100
kmalloc-8192 512 8192 4.1M 0/0/8 64 3 0 100
kmem_cache_node 3276 80 262.1K 0/0/4 819 0 0 99 *
mqueue_inode_cache 56 1064 65.5K 0/0/1 56 0 0 90 A
numa_policy 248 264 65.5K 0/0/1 248 0 0 99
proc_inode_cache 655 792 720.8K 3/3/8 81 0 27 71 a
radix_tree_node 1142 552 917.5K 6/6/8 117 0 42 68 a
shmem_inode_cache 1230 1000 1.3M 12/3/8 65 0 15 93
sighand_cache 434 1608 917.5K 6/4/8 39 0 28 76 A
sigqueue 3272 160 524.2K 0/0/8 409 0 0 99
sock_inode_cache 774 832 851.9K 5/3/8 73 0 23 75 Aa
TCP 144 1712 262.1K 0/0/4 36 0 0 94 A
vm_area_struct 4034 176 851.9K 5/5/8 372 0 38 83
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-18 10:08 ` KOSAKI Motohiro
@ 2008-08-18 10:34 ` KOSAKI Motohiro
2008-08-18 14:08 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-18 10:34 UTC (permalink / raw)
To: Christoph Lameter
Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
> > Christoph Lameter wrote:
> >
> > > Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> > > slab instead of allocating a new one.
> >
> > As pointed out by Adrian D. off list:
> >
> > The max remote_node_defrag_ratio is 99.
> >
> > Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
> > allow 100 to switch off any node local allocs?
>
> Hmmm,
> it doesn't change any behavior.
Ah, ok.
I did mistakes.
new patch is here.
Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1326,9 +1326,11 @@ static struct page *get_any_partial(stru
* expensive if we do it every time we are trying to find a slab
* with available objects.
*/
+#if 0
if (!s->remote_node_defrag_ratio ||
get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;
+#endif
zonelist = node_zonelist(slab_node(current->mempolicy), flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
new result is here.
% cat /proc/meminfo
MemTotal: 7701504 kB
MemFree: 5986432 kB
Buffers: 7872 kB
Cached: 38208 kB
SwapCached: 0 kB
Active: 120256 kB
Inactive: 14656 kB
Active(anon): 90304 kB
Inactive(anon): 0 kB
Active(file): 29952 kB
Inactive(file): 14656 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2031488 kB
SwapFree: 2031488 kB
Dirty: 448 kB
Writeback: 0 kB
AnonPages: 89088 kB
Mapped: 31360 kB
Slab: 69952 kB
SReclaimable: 13376 kB
SUnreclaim: 56576 kB
PageTables: 11648 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5882240 kB
Committed_AS: 453440 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29312 kB
VmallocChunk: 17592177626112 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB
% slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
:at-0000016 4096 16 65.5K 0/0/1 4096 0 0 100 *a
:at-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *a
:at-0000032 2048 32 65.5K 0/0/1 2048 0 0 100 *Aa
:at-0000088 2976 88 262.1K 0/0/4 744 0 0 99 *a
:at-0000096 4774 96 458.7K 0/0/7 682 0 0 99 *a
:t-0000016 32768 16 524.2K 0/0/8 4096 0 0 100 *
:t-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *
:t-0000032 34806 32 1.1M 9/1/8 2048 0 5 99 *
:t-0000040 14279 40 851.9K 5/5/8 1638 0 38 67 *
:t-0000048 5460 48 262.1K 0/0/4 1365 0 0 99 *
:t-0000064 10224 64 655.3K 2/1/8 1024 0 10 99 *
:t-0000072 29109 72 2.0M 26/4/6 910 0 12 99 *
:t-0000080 16379 80 1.3M 12/1/8 819 0 5 99 *
:t-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *
:t-0000128 27831 128 3.6M 48/8/8 512 0 14 97 *
:t-0000256 15401 256 9.8M 143/96/8 256 0 63 39 *
:t-0000384 1360 352 524.2K 0/0/8 170 0 0 91 *A
:t-0000512 2307 512 1.2M 11/3/8 128 0 15 94 *
:t-0000768 755 768 720.8K 3/3/8 85 0 27 80 *A
:t-0000896 728 880 851.9K 5/4/8 73 0 30 75 *A
:t-0001024 1810 1024 1.9M 21/4/8 64 0 13 97 *
:t-0002048 2621 2048 5.5M 34/15/8 64 1 35 97 *
:t-0004096 775 4096 3.4M 5/2/8 64 2 15 93 *
anon_vma 10920 40 524.2K 0/0/8 1365 0 0 83
bdev_cache 192 1008 196.6K 0/0/3 64 0 0 98 Aa
blkdev_queue 140 1864 262.1K 0/0/2 70 1 0 99
blkdev_requests 1720 304 524.2K 0/0/8 215 0 0 99
buffer_head 8020 104 2.7M 34/32/8 585 0 76 30 a
cfq_io_context 3120 168 524.2K 0/0/8 390 0 0 99
cfq_queue 3848 136 524.2K 0/0/8 481 0 0 99
dentry 3798 224 2.5M 31/30/8 292 0 76 33 a
ext3_inode_cache 1127 1016 2.7M 34/34/8 64 0 80 41 a
fat_inode_cache 77 840 65.5K 0/0/1 77 0 0 98 a
file_lock_cache 2289 192 458.7K 0/0/7 327 0 0 95
hugetlbfs_inode_cache 83 776 65.5K 0/0/1 83 0 0 98
idr_layer_cache 944 544 524.2K 0/0/8 118 0 0 97
inode_cache 1044 744 786.4K 4/0/8 87 0 0 98 a
kmalloc-16384 160 16384 2.6M 0/0/5 32 3 0 100
kmalloc-192 3883 192 1.0M 8/8/8 341 0 50 71
kmalloc-32768 128 32768 4.1M 0/0/8 16 3 0 100
kmalloc-65536 32 65536 2.0M 0/0/8 4 2 0 100
kmalloc-8 65536 8 524.2K 0/0/8 8192 0 0 100
kmalloc-8192 512 8192 4.1M 0/0/8 64 3 0 100
kmem_cache_node 3276 80 262.1K 0/0/4 819 0 0 99 *
mqueue_inode_cache 56 1064 65.5K 0/0/1 56 0 0 90 A
numa_policy 248 264 65.5K 0/0/1 248 0 0 99
proc_inode_cache 653 792 655.3K 2/2/8 81 0 20 78 a
radix_tree_node 1221 552 983.0K 7/7/8 117 0 46 68 a
shmem_inode_cache 1218 1000 1.3M 12/3/8 65 0 15 92
sighand_cache 416 1608 851.9K 5/3/8 39 0 23 78 A
sigqueue 3272 160 524.2K 0/0/8 409 0 0 99
sock_inode_cache 758 832 786.4K 4/3/8 73 0 25 80 Aa
TCP 180 1712 327.6K 0/0/5 36 0 0 94 A
vm_area_struct 4054 176 851.9K 5/5/8 372 0 38 83
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-18 10:34 ` KOSAKI Motohiro
@ 2008-08-18 14:08 ` Christoph Lameter
2008-08-19 10:34 ` KOSAKI Motohiro
0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-18 14:08 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
> new patch is here.
>
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1326,9 +1326,11 @@ static struct page *get_any_partial(stru
> * expensive if we do it every time we are trying to find a slab
> * with available objects.
> */
> +#if 0
> if (!s->remote_node_defrag_ratio ||
> get_cycles() % 1024 > s->remote_node_defrag_ratio)
> return NULL;
> +#endif
>
> zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
Hmmm.... So always take from partial lists works? That is the same effect that
the setting of the remote_defrag_ratio to 100 should have had (its multiplied
by 10 when storing it).
So its a NUMA only phenomenon. How is performance affected?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-18 14:08 ` Christoph Lameter
@ 2008-08-19 10:34 ` KOSAKI Motohiro
2008-08-19 13:51 ` Christoph Lameter
0 siblings, 1 reply; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-19 10:34 UTC (permalink / raw)
To: Christoph Lameter
Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
> > +#if 0
> > if (!s->remote_node_defrag_ratio ||
> > get_cycles() % 1024 > s->remote_node_defrag_ratio)
> > return NULL;
> > +#endif
> >
> > zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> > for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>
> Hmmm.... So always take from partial lists works? That is the same effect that
> the setting of the remote_defrag_ratio to 100 should have had (its multiplied
> by 10 when storing it).
Sorry, I don't know reason.
OK, I'll digg it more.
> So its a NUMA only phenomenon. How is performance affected?
Unfortunately, I can't mesure it.
because
- Fujitsu server can access remote node fastly than typical numa server.
So, my performance number often isn't typical.
- My box (4G x2node) is very small in NUMA machine.
but that is large server improving mechanism.
IOW, My box didn't happend performance regression.
but I think it isn't typical.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-19 10:34 ` KOSAKI Motohiro
@ 2008-08-19 13:51 ` Christoph Lameter
2008-08-20 11:46 ` KOSAKI Motohiro
0 siblings, 1 reply; 64+ messages in thread
From: Christoph Lameter @ 2008-08-19 13:51 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Matthew Wilcox, Pekka Enberg, akpm, linux-kernel, linux-fsdevel,
Mel Gorman, andi, Rik van Riel
KOSAKI Motohiro wrote:
> IOW, My box didn't happend performance regression.
> but I think it isn't typical.
Well that is typical for small NUMA system. Maybe this patch will fix it for
now? Large systems can be tuned by setting the ratio lower.
Subject: slub/NUMA: Disable remote node defragmentation by default
Switch remote node defragmentation off by default. The current settings can
cause excessive node local allocations with hackbench. (Note that this feature
is not related to slab defragmentation).
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-08-19 06:45:54.732348449 -0700
+++ linux-2.6/mm/slub.c 2008-08-19 06:46:12.442348249 -0700
@@ -2312,7 +2312,7 @@ static int kmem_cache_open(struct kmem_c
s->refcount = 1;
#ifdef CONFIG_NUMA
- s->remote_node_defrag_ratio = 100;
+ s->remote_node_defrag_ratio = 1000;
#endif
if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
goto error;
@@ -4058,7 +4058,7 @@ static ssize_t remote_node_defrag_ratio_
if (err)
return err;
- if (ratio < 100)
+ if (ratio <= 100)
s->remote_node_defrag_ratio = ratio * 10;
return length;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: No, really, stop trying to delete slab until you've finished making slub perform as well
2008-08-19 13:51 ` Christoph Lameter
@ 2008-08-20 11:46 ` KOSAKI Motohiro
0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2008-08-20 11:46 UTC (permalink / raw)
To: Christoph Lameter
Cc: kosaki.motohiro, Matthew Wilcox, Pekka Enberg, akpm, linux-kernel,
linux-fsdevel, Mel Gorman, andi, Rik van Riel
> KOSAKI Motohiro wrote:
>
> > IOW, My box didn't happend performance regression.
> > but I think it isn't typical.
>
> Well that is typical for small NUMA system. Maybe this patch will fix it for
> now? Large systems can be tuned by setting the ratio lower.
>
>
> Subject: slub/NUMA: Disable remote node defragmentation by default
>
> Switch remote node defragmentation off by default. The current settings can
> cause excessive node local allocations with hackbench. (Note that this feature
> is not related to slab defragmentation).
OK.
I confirmed this patch works well.
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> ---
> mm/slub.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-08-19 06:45:54.732348449 -0700
> +++ linux-2.6/mm/slub.c 2008-08-19 06:46:12.442348249 -0700
> @@ -2312,7 +2312,7 @@ static int kmem_cache_open(struct kmem_c
>
> s->refcount = 1;
> #ifdef CONFIG_NUMA
> - s->remote_node_defrag_ratio = 100;
> + s->remote_node_defrag_ratio = 1000;
> #endif
> if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
> goto error;
> @@ -4058,7 +4058,7 @@ static ssize_t remote_node_defrag_ratio_
> if (err)
> return err;
>
> - if (ratio < 100)
> + if (ratio <= 100)
> s->remote_node_defrag_ratio = ratio * 10;
>
> return length;
^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2008-08-20 11:47 UTC | newest]
Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-10 2:21 [patch 00/19] Slab Fragmentation Reduction V13 Christoph Lameter
2008-05-10 2:21 ` [patch 01/19] slub: Add defrag_ratio field and sysfs support Christoph Lameter
2008-05-10 2:21 ` [patch 02/19] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
2008-05-10 2:21 ` [patch 03/19] slub: Add get() and kick() methods Christoph Lameter
2008-05-10 2:21 ` [patch 04/19] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
2008-05-10 2:21 ` [patch 05/19] slub: Slab defrag core Christoph Lameter
2008-05-10 2:21 ` [patch 06/19] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
2008-05-10 2:21 ` [patch 07/19] slub: Extend slabinfo to support -D and -F options Christoph Lameter
2008-05-10 2:21 ` [patch 08/19] slub/slabinfo: add defrag statistics Christoph Lameter
2008-05-10 2:21 ` [patch 09/19] slub: Trigger defragmentation from memory reclaim Christoph Lameter
2008-05-10 2:21 ` [patch 10/19] buffer heads: Support slab defrag Christoph Lameter
2008-05-10 2:21 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
2008-05-10 2:21 ` [patch 12/19] Filesystem: Ext2 filesystem defrag Christoph Lameter
2008-05-10 2:21 ` [patch 13/19] Filesystem: Ext3 " Christoph Lameter
2008-05-10 2:21 ` [patch 14/19] Filesystem: Ext4 " Christoph Lameter
2008-08-03 1:54 ` Theodore Tso
2008-08-13 7:26 ` Pekka Enberg
2008-05-10 2:21 ` [patch 15/19] Filesystem: XFS slab defragmentation Christoph Lameter
2008-08-03 1:42 ` Dave Chinner
2008-08-04 13:36 ` Christoph Lameter
2008-05-10 2:21 ` [patch 16/19] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
2008-05-10 2:21 ` [patch 17/19] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
2008-05-10 2:21 ` [patch 18/19] dentries: Add constructor Christoph Lameter
2008-05-10 2:21 ` [patch 19/19] dentries: dentry defragmentation Christoph Lameter
2008-08-03 1:58 ` No, really, stop trying to delete slab until you've finished making slub perform as well Matthew Wilcox
2008-08-03 21:25 ` Pekka Enberg
2008-08-04 2:37 ` Rene Herman
2008-08-04 21:22 ` Pekka Enberg
2008-08-04 21:41 ` Christoph Lameter
2008-08-04 23:09 ` Rene Herman
2008-08-04 13:43 ` Christoph Lameter
2008-08-04 14:48 ` Jamie Lokier
2008-08-04 15:21 ` Jamie Lokier
2008-08-04 16:35 ` Christoph Lameter
2008-08-04 15:11 ` Rik van Riel
2008-08-04 16:02 ` Christoph Lameter
2008-08-04 16:47 ` KOSAKI Motohiro
2008-08-04 17:13 ` Christoph Lameter
2008-08-04 17:20 ` Pekka Enberg
2008-08-05 12:06 ` KOSAKI Motohiro
2008-08-05 14:59 ` Christoph Lameter
2008-08-06 12:36 ` KOSAKI Motohiro
2008-08-06 14:24 ` Christoph Lameter
2008-08-13 10:46 ` KOSAKI Motohiro
2008-08-13 13:10 ` Christoph Lameter
2008-08-13 14:14 ` KOSAKI Motohiro
2008-08-13 14:16 ` Pekka Enberg
2008-08-13 14:31 ` Christoph Lameter
2008-08-13 15:05 ` KOSAKI Motohiro
2008-08-14 19:44 ` Christoph Lameter
2008-08-15 16:44 ` KOSAKI Motohiro
2008-08-15 18:24 ` Christoph Lameter
2008-08-15 19:42 ` Christoph Lameter
2008-08-18 10:08 ` KOSAKI Motohiro
2008-08-18 10:34 ` KOSAKI Motohiro
2008-08-18 14:08 ` Christoph Lameter
2008-08-19 10:34 ` KOSAKI Motohiro
2008-08-19 13:51 ` Christoph Lameter
2008-08-20 11:46 ` KOSAKI Motohiro
2008-08-14 7:15 ` Pekka Enberg
2008-08-14 14:45 ` Christoph Lameter
2008-08-14 15:06 ` Christoph Lameter
2008-08-04 17:19 ` Christoph Lameter
-- strict thread matches above, loose matches on Subject: below --
2008-08-11 15:06 [patch 00/19] Slab Fragmentation Reduction V14 Christoph Lameter
2008-08-11 15:06 ` [patch 11/19] inodes: Support generic defragmentation Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).