linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
@ 2012-11-27 23:14 Dave Chinner
  2012-11-27 23:14 ` [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters Dave Chinner
                   ` (21 more replies)
  0 siblings, 22 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

Hi Glauber,

Here's a working version of my patchset for generic LRU lists and
NUMA-aware shrinkers.

There are several parts to this patch set. The NUMA aware shrinkers
are based on having a generic node-based LRU list implementation,
and there are subsystems that need to be converted to use these
lists as part of the process. There is also a long overdue change to
the shrinker API to give it separate object count and object scan
callbacks, getting rid of the magic and confusing "nr_to_scan = 0"
semantics.

First of all, the patch set is based on a current 3.7-rc7 tree with
the current xfs dev tree merged into it [can be found at
git://oss.sgi.com/xfs/xfs]. That's because there are lots of XFS
changes in the patch set, and theres no way I'm going to write them
a second time in a couple of weeks when the current dev tree is
merged into 3.8-rc1....

So, where's what the patches do:

[PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu
[PATCH 02/19] dentry: move to per-sb LRU locks
[PATCH 03/19] dcache: remove dentries from LRU before putting on

These three patches are preparation of the dcache for moving to the
generic LRU list API. it basically gets rid of the global dentry LRU
lock, and in doing so has to avoid several creative abuses of the
lru list detection to allow dentries on shrink lists to be still
magically be on the LRU list. The main change here is that now
dentries on the shrink lists *must* have the DCACHE_SHRINK_LIST flag
set and be entirely removed from the LRU before being disposed of.

This is probably a good cleanup to do regardless of the rest of the
patch set because it removes a couple of landmines in
shrink_dentry_list() that took me a while to work out...

[PATCH 04/19] mm: new shrinker API
[PATCH 05/19] shrinker: convert superblock shrinkers to new API

These introduce the count/scan shrinker API, and for testing
purposes convert the superblock shrinker to use it before any other
changes are made. This gives a clean separation of counting the
number of objects in a cache for pressure calculations, and the act
of scanning objects in an attempt to free memory. Indeed, the
scan_objects() callback now returns the number of objects freed by
the scan instead of having to try to work out whether any progress
was made by comparing absolute counts.

This is also more efficient as we don't have to count all the
objects in a cache on every scan pass. It is now done once per
shrink_slab() invocation to calculate how much to scan, and we get
direct feedback on how much gets reclaimed in that pass. i.e. we get
reliable, accurate feedback on shrinker progress.

[PATCH 06/19] list: add a new LRU list type
[PATCH 07/19] inode: convert inode lru list to generic lru list
[PATCH 08/19] dcache: convert to use new lru list infrastructure

These add the generic LRU list API and infrastructure and convert
the inode and dentry caches to use it. This is still just a single
global list per LRU at this point, so it's really only changing the
where the LRU implemenation is rather than the fundamental
algorithm. It does, however, introduce a new method of walking the
LRU lists and building the dispose list of items for shrinkers, but
because we are still dealing with a global list the algorithmic
changes are minimal.

[PATCH 09/19] list_lru: per-node list infrastructure

This makes the generic LRU list much more scalable by changing it to
a {list,lock,count} tuple per node. There are no external API
changes to this changeover, so is transparent to current users.

[PATCH 10/19] shrinker: add node awareness
[PATCH 11/19] fs: convert inode and dentry shrinking to be node

Adds a nodemask to the struct shrink_control for callers of
shrink_slab to set appropriately for their reclaim context. This
nodemask is then passed by the inode and dentry cache reclaim code
to the generic LRU list code to implement node aware shrinking.

What this doesn't do is convert the internal shrink_slab() algorithm
to be node aware. I'm not sure what the best approach is here, but
it strikes me that it should really be calculating and keeping track
of scan counts and pressure on a per-node basis. The current code
seems to work OK at the moment, though.

[PATCH 12/19] xfs: convert buftarg LRU to generic code
[PATCH 13/19] xfs: Node aware direct inode reclaim
[PATCH 14/19] xfs: use generic AG walk for background inode reclaim
[PATCH 15/19] xfs: convert dquot cache lru to list_lru

These patches convert all the XFS LRUs and shrinkers to be node
aware. This gets rid of a lot of hairy, special case code in the
inode cache shrinker for avoiding concurrent shrinker contention and
to throttle direct reclaim to prevent premature OOM conditions.
Removing this code greatly simplifies inode cache reclaim whilst
reducing overhead and improving performance. In all, it converts
three separate caches and shrinkers to use the generic LRU lists and
pass nodemasks around appropriately.

This is how I've really tested the code - lots of interesting
filesystem workloads that generate simultaneous slab and page cache
pressure on VM's with and without fake_numa configs....

[PATCH 16/19] fs: convert fs shrinkers to new scan/count API
[PATCH 17/19] drivers: convert shrinkers to new count/scan API
[PATCH 18/19] shrinker: convert remaining shrinkers to count/scan
[PATCH 19/19] shrinker: Kill old ->shrink API.

These last three patches convert all the other shrinker
implementations to the new count/scan API.  The fs, android and dm
shrinkers are pretty well behaved and are implemented as is expected
for there intended purposes. The driver and staging code, however,
is basically a bunch of hacks to try to do something resembling
reclaim when a shrinker tells it there is memory pressure. Looking
at all the driver and staging code is not an exercise I recommend if
you value your eyes and/or your sanity.

I haven't even tried to compile this on a CONFIG_SMP=n
configuration, nor have I done extensive allmod style build tests
and it's only been built and tested on x86-64. That said, apart from
CONFIG_SMP=n, I don't see there being any major problems here.

There's still a bunch of cleanup work needed. e.g. the LRU list
walk/isolation code needs to use enums for the isolate callback
return code, there needs to be a generic list_lru_for_each() style
function for walking all the objects in the cache (which will allow
the list_lru structures to be used for things like the per-sb inode
list). Indeed, even the name "list_lru" is probably something that
should be changed - I think the list has become more of a general
per-node list than it's initial functionality as a scalable LRU list
implementation and I can see uses for it outside of LRUs...

Comments, thoughts and flames all welcome.

Cheers,

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 02/19] dentry: move to per-sb LRU locks Dave Chinner
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Before we split up the dcache_lru_lock, the unused dentry counter
needs to be made independent of the global dcache_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dcache.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3a463d0..2fc0daa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static DEFINE_PER_CPU(unsigned int, nr_dentry);
+static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 static int get_nr_dentry(void)
@@ -129,10 +130,20 @@ static int get_nr_dentry(void)
 	return sum < 0 ? 0 : sum;
 }
 
+static int get_nr_dentry_unused(void)
+{
+	int i;
+	int sum = 0;
+	for_each_possible_cpu(i)
+		sum += per_cpu(nr_dentry_unused, i);
+	return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
 		   size_t *lenp, loff_t *ppos)
 {
 	dentry_stat.nr_dentry = get_nr_dentry();
+	dentry_stat.nr_unused = get_nr_dentry_unused();
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -312,7 +323,7 @@ static void dentry_lru_add(struct dentry *dentry)
 		spin_lock(&dcache_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
+		this_cpu_inc(nr_dentry_unused);
 		spin_unlock(&dcache_lru_lock);
 	}
 }
@@ -322,7 +333,7 @@ static void __dentry_lru_del(struct dentry *dentry)
 	list_del_init(&dentry->d_lru);
 	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	dentry->d_sb->s_nr_dentry_unused--;
-	dentry_stat.nr_unused--;
+	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -360,7 +371,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
 		dentry->d_sb->s_nr_dentry_unused++;
-		dentry_stat.nr_unused++;
+		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
 	}
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/19] dentry: move to per-sb LRU locks
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
  2012-11-27 23:14 ` [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list Dave Chinner
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

With the dentry LRUs being per-sb structures, there is no real need
for a global dentry_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dcache.c        |   37 ++++++++++++++++++-------------------
 fs/super.c         |    1 +
 include/linux/fs.h |    4 +++-
 3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 2fc0daa..e0c97fe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -48,7 +48,7 @@
  *   - the dcache hash table
  * s_anon bl list spinlock protects:
  *   - the s_anon list (see __d_drop)
- * dcache_lru_lock protects:
+ * dentry->d_sb->s_dentry_lru_lock protects:
  *   - the dcache lru lists and counters
  * d_lock protects:
  *   - d_flags
@@ -63,7 +63,7 @@
  * Ordering:
  * dentry->d_inode->i_lock
  *   dentry->d_lock
- *     dcache_lru_lock
+ *     dentry->d_sb->s_dentry_lru_lock
  *     dcache_hash_bucket lock
  *     s_anon lock
  *
@@ -81,7 +81,6 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
@@ -320,11 +319,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
 static void dentry_lru_add(struct dentry *dentry)
 {
 	if (list_empty(&dentry->d_lru)) {
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
 		this_cpu_inc(nr_dentry_unused);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
@@ -342,9 +341,9 @@ static void __dentry_lru_del(struct dentry *dentry)
 static void dentry_lru_del(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
@@ -359,15 +358,15 @@ static void dentry_lru_prune(struct dentry *dentry)
 		if (dentry->d_flags & DCACHE_OP_PRUNE)
 			dentry->d_op->d_prune(dentry);
 
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 	}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
 		dentry->d_sb->s_nr_dentry_unused++;
@@ -375,7 +374,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 	} else {
 		list_move_tail(&dentry->d_lru, list);
 	}
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
 
 /**
@@ -879,14 +878,14 @@ void prune_dcache_sb(struct super_block *sb, int count)
 	LIST_HEAD(tmp);
 
 relock:
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		dentry = list_entry(sb->s_dentry_lru.prev,
 				struct dentry, d_lru);
 		BUG_ON(dentry->d_sb != sb);
 
 		if (!spin_trylock(&dentry->d_lock)) {
-			spin_unlock(&dcache_lru_lock);
+			spin_unlock(&sb->s_dentry_lru_lock);
 			cpu_relax();
 			goto relock;
 		}
@@ -902,11 +901,11 @@ relock:
 			if (!--count)
 				break;
 		}
-		cond_resched_lock(&dcache_lru_lock);
+		cond_resched_lock(&sb->s_dentry_lru_lock);
 	}
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&sb->s_dentry_lru_lock);
 
 	shrink_dentry_list(&tmp);
 }
@@ -922,14 +921,14 @@ void shrink_dcache_sb(struct super_block *sb)
 {
 	LIST_HEAD(tmp);
 
-	spin_lock(&dcache_lru_lock);
+	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
-		spin_unlock(&dcache_lru_lock);
+		spin_unlock(&sb->s_dentry_lru_lock);
 		shrink_dentry_list(&tmp);
-		spin_lock(&dcache_lru_lock);
+		spin_lock(&sb->s_dentry_lru_lock);
 	}
-	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&sb->s_dentry_lru_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
diff --git a/fs/super.c b/fs/super.c
index 12f1237..21abf02 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -182,6 +182,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
+		spin_lock_init(&s->s_dentry_lru_lock);
 		INIT_LIST_HEAD(&s->s_inode_lru);
 		spin_lock_init(&s->s_inode_lru_lock);
 		INIT_LIST_HEAD(&s->s_mounts);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..0845283 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1262,7 +1262,9 @@ struct super_block {
 	struct list_head	s_files;
 #endif
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
+
+	/* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
+	spinlock_t		s_dentry_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
  2012-11-27 23:14 ` [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters Dave Chinner
  2012-11-27 23:14 ` [PATCH 02/19] dentry: move to per-sb LRU locks Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 04/19] mm: new shrinker API Dave Chinner
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

One of the big problems with modifying the way the dcache shrinker
and LRU implementation works is that the LRU is abused in several
ways. One of these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list
without doing any accounting changes, and then use dentry_lru_prune()
to remove it from what-ever list it is now on to do the LRU
accounting at that point.

This makes it -really hard- to change the LRU implementation. The
use of the per-sb LRU lock serialises movement of the dentries
between the different lists and the removal of them, and this is the
only reason that it works. If we want to break up the dentry LRU
lock and lists into, say, per-node lists, we remove the only
serialisation that allows this lru list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated
from the LRU list - dentries have to be removed from the LRU
*before* being placed on the dispose list. This means that the LRU
accounting and isolation is completed before disposal is started,
and that means we can change the LRU implementation freely in
future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST
when they are placed on the dispose list so that we don't think that
parent dentries found in try_prune_one_dentry() are on the LRU when
the are actually on the dispose list. This would result in
accounting the dentry to the LRU a second time. Hence
dentry_lru_prune() has to handle the DCACHE_SHRINK_LIST case
differently because the dentry isn't on the LRU list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c |   73 +++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 63 insertions(+), 10 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e0c97fe..0124a84 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -330,7 +330,6 @@ static void dentry_lru_add(struct dentry *dentry)
 static void __dentry_lru_del(struct dentry *dentry)
 {
 	list_del_init(&dentry->d_lru);
-	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	dentry->d_sb->s_nr_dentry_unused--;
 	this_cpu_dec(nr_dentry_unused);
 }
@@ -340,6 +339,8 @@ static void __dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_del(struct dentry *dentry)
 {
+	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
 	if (!list_empty(&dentry->d_lru)) {
 		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
@@ -351,28 +352,42 @@ static void dentry_lru_del(struct dentry *dentry)
  * Remove a dentry that is unreferenced and about to be pruned
  * (unhashed and destroyed) from the LRU, and inform the file system.
  * This wrapper should be called _prior_ to unhashing a victim dentry.
+ *
+ * Check that the dentry really is on the LRU as it may be on a private dispose
+ * list and in that case we do not want to call the generic LRU removal
+ * functions. This typically happens when shrink_dcache_sb() clears the LRU in
+ * one go and then try_prune_one_dentry() walks back up the parent chain finding
+ * dentries that are also on the dispose list.
  */
 static void dentry_lru_prune(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
+
 		if (dentry->d_flags & DCACHE_OP_PRUNE)
 			dentry->d_op->d_prune(dentry);
 
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		__dentry_lru_del(dentry);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+		if ((dentry->d_flags & DCACHE_SHRINK_LIST))
+			list_del_init(&dentry->d_lru);
+		else {
+			spin_lock(&dentry->d_sb->s_dentry_lru_lock);
+			__dentry_lru_del(dentry);
+			spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+		}
+		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 	}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
+	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
 	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused++;
-		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
+		dentry->d_sb->s_nr_dentry_unused--;
+		this_cpu_dec(nr_dentry_unused);
 	}
 	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
@@ -840,12 +855,18 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 
 		/*
+		 * The dispose list is isolated and dentries are not accounted
+		 * to the LRU here, so we can simply remove it from the list
+		 * here regardless of whether it is referenced or not.
+		 */
+		list_del_init(&dentry->d_lru);
+
+		/*
 		 * We found an inuse dentry which was not removed from
-		 * the LRU because of laziness during lookup.  Do not free
-		 * it - just keep it off the LRU list.
+		 * the LRU because of laziness during lookup. Do not free it.
 		 */
 		if (dentry->d_count) {
-			dentry_lru_del(dentry);
+			dentry->d_flags &= ~DCACHE_SHRINK_LIST;
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
@@ -897,6 +918,8 @@ relock:
 		} else {
 			list_move_tail(&dentry->d_lru, &tmp);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
+			this_cpu_dec(nr_dentry_unused);
+			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
 			if (!--count)
 				break;
@@ -910,6 +933,28 @@ relock:
 	shrink_dentry_list(&tmp);
 }
 
+/*
+ * Mark all the dentries as on being the dispose list so we don't think they are
+ * still on the LRU if we try to kill them from ascending the parent chain in
+ * try_prune_one_dentry() rather than directly from the dispose list.
+ */
+static void
+shrink_dcache_list(
+	struct list_head *dispose)
+{
+	struct dentry *dentry;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(dentry, dispose, d_lru) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_SHRINK_LIST;
+		this_cpu_dec(nr_dentry_unused);
+		spin_unlock(&dentry->d_lock);
+	}
+	rcu_read_unlock();
+	shrink_dentry_list(dispose);
+}
+
 /**
  * shrink_dcache_sb - shrink dcache for a superblock
  * @sb: superblock
@@ -924,8 +969,16 @@ void shrink_dcache_sb(struct super_block *sb)
 	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
+
+		/*
+		 * account for removal here so we don't need to handle it later
+		 * even though the dentry is no longer on the lru list.
+		 */
+		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
+		sb->s_nr_dentry_unused = 0;
+
 		spin_unlock(&sb->s_dentry_lru_lock);
-		shrink_dentry_list(&tmp);
+		shrink_dcache_list(&tmp);
 		spin_lock(&sb->s_dentry_lru_lock);
 	}
 	spin_unlock(&sb->s_dentry_lru_lock);
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/19] mm: new shrinker API
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (2 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 05/19] shrinker: convert superblock shrinkers to new API Dave Chinner
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

The current shrinker callout API uses an a single shrinker call for
multiple functions. To determine the function, a special magical
value is passed in a parameter to change the behaviour. This
complicates the implementation and return value specification for
the different behaviours.

Separate the two different behaviours into separate operations, one
to return a count of freeable objects in the cache, and another to
scan a certain number of objects in the cache for freeing. In
defining these new operations, ensure the return values and
resultant behaviours are clearly defined and documented.

Modify shrink_slab() to use the new API and implement the callouts
for all the existing shrinkers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h |   37 ++++++++++++++++++++++++----------
 mm/vmscan.c              |   50 +++++++++++++++++++++++++++++-----------------
 2 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index ac6b8ee..4f59615 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,31 +4,47 @@
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
  * We consolidate the values for easier extention later.
+ *
+ * The 'gfpmask' refers to the allocation we are currently trying to
+ * fulfil.
+ *
+ * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
+ * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
 	gfp_t gfp_mask;
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
-	unsigned long nr_to_scan;
+	long nr_to_scan;
 };
 
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
- * and a 'gfpmask'.  It should look through the least-recently-used
- * 'nr_to_scan' entries and attempt to free them up.  It should return
- * the number of objects which remain in the cache.  If it returns -1, it means
- * it cannot do any scanning at this time (eg. there is a risk of deadlock).
+ * @shrink() should look through the least-recently-used 'nr_to_scan' entries
+ * and attempt to free them up.  It should return the number of objects which
+ * remain in the cache.  If it returns -1, it means it cannot do any scanning at
+ * this time (eg. there is a risk of deadlock).
  *
- * The 'gfpmask' refers to the allocation we are currently trying to
- * fulfil.
+ * @count_objects should return the number of freeable items in the cache. If
+ * there are no objects to free or the number of freeable items cannot be
+ * determined, it should return 0. No deadlock checks should be done during the
+ * count callback - the shrinker relies on aggregating scan counts that couldn't
+ * be executed due to potential deadlocks to be run at a later call when the
+ * deadlock condition is no longer pending.
  *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
+ * @scan_objects will only be called if @count_objects returned a positive
+ * value for the number of freeable objects. The callout should scan the cache
+ * and attemp to free items from the cache. It should then return the number of
+ * objects freed during the scan, or -1 if progress cannot be made due to
+ * potential deadlocks. If -1 is returned, then no further attempts to call the
+ * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
 	int (*shrink)(struct shrinker *, struct shrink_control *sc);
+	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
+	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
+
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
 
@@ -36,6 +52,7 @@ struct shrinker {
 	struct list_head list;
 	atomic_long_t nr_in_batch; /* objs pending delete */
 };
+
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..55c4fc9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,19 +204,19 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
  *
  * Returns the number of slab objects which we shrunk.
  */
-unsigned long shrink_slab(struct shrink_control *shrink,
+unsigned long shrink_slab(struct shrink_control *sc,
 			  unsigned long nr_pages_scanned,
 			  unsigned long lru_pages)
 {
 	struct shrinker *shrinker;
-	unsigned long ret = 0;
+	unsigned long freed = 0;
 
 	if (nr_pages_scanned == 0)
 		nr_pages_scanned = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem)) {
 		/* Assume we'll be able to shrink next time */
-		ret = 1;
+		freed = 1;
 		goto out;
 	}
 
@@ -224,13 +224,16 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		unsigned long long delta;
 		long total_scan;
 		long max_pass;
-		int shrink_ret = 0;
 		long nr;
 		long new_nr;
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
-		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+		if (shrinker->scan_objects) {
+			max_pass = shrinker->count_objects(shrinker, sc);
+			WARN_ON(max_pass < 0);
+		} else
+			max_pass = do_shrinker_shrink(shrinker, sc, 0);
 		if (max_pass <= 0)
 			continue;
 
@@ -247,8 +250,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		do_div(delta, lru_pages + 1);
 		total_scan += delta;
 		if (total_scan < 0) {
-			printk(KERN_ERR "shrink_slab: %pF negative objects to "
-			       "delete nr=%ld\n",
+			printk(KERN_ERR
+			"shrink_slab: %pF negative objects to delete nr=%ld\n",
 			       shrinker->shrink, total_scan);
 			total_scan = max_pass;
 		}
@@ -276,20 +279,31 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		if (total_scan > max_pass * 2)
 			total_scan = max_pass * 2;
 
-		trace_mm_shrink_slab_start(shrinker, shrink, nr,
+		trace_mm_shrink_slab_start(shrinker, sc, nr,
 					nr_pages_scanned, lru_pages,
 					max_pass, delta, total_scan);
 
 		while (total_scan >= batch_size) {
-			int nr_before;
+			long ret;
 
-			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
-			shrink_ret = do_shrinker_shrink(shrinker, shrink,
-							batch_size);
-			if (shrink_ret == -1)
-				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
+			if (shrinker->scan_objects) {
+				sc->nr_to_scan = batch_size;
+				ret = shrinker->scan_objects(shrinker, sc);
+
+				if (ret == -1)
+					break;
+				freed += ret;
+			} else {
+				int nr_before;
+
+				nr_before = do_shrinker_shrink(shrinker, sc, 0);
+				ret = do_shrinker_shrink(shrinker, sc,
+								batch_size);
+				if (ret == -1)
+					break;
+				if (ret < nr_before)
+					freed += nr_before - ret;
+			}
 			count_vm_events(SLABS_SCANNED, batch_size);
 			total_scan -= batch_size;
 
@@ -307,12 +321,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 		else
 			new_nr = atomic_long_read(&shrinker->nr_in_batch);
 
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+		trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
 	}
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
-	return ret;
+	return freed;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/19] shrinker: convert superblock shrinkers to new API
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (3 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 04/19] mm: new shrinker API Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-12-20 11:06   ` Glauber Costa
  2012-11-27 23:14 ` [PATCH 06/19] list: add a new LRU list type Dave Chinner
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert superblock shrinker to use the new count/scan API, and
propagate the API changes through to the filesystem callouts. The
filesystem callouts already use a count/scan API, so it's just
changing counters to longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted
to the count/scan API. This is mainly a mechanical change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c         |   10 +++++---
 fs/inode.c          |    7 +++--
 fs/internal.h       |    3 +++
 fs/super.c          |   71 ++++++++++++++++++++++++++++++---------------------
 fs/xfs/xfs_icache.c |   17 +++++++-----
 fs/xfs/xfs_icache.h |    4 +--
 fs/xfs/xfs_super.c  |    8 +++---
 include/linux/fs.h  |    8 ++----
 8 files changed, 74 insertions(+), 54 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0124a84..ca647b8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -892,11 +892,12 @@ static void shrink_dentry_list(struct list_head *list)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-void prune_dcache_sb(struct super_block *sb, int count)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
 {
 	struct dentry *dentry;
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
+	long freed = 0;
 
 relock:
 	spin_lock(&sb->s_dentry_lru_lock);
@@ -921,7 +922,8 @@ relock:
 			this_cpu_dec(nr_dentry_unused);
 			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
-			if (!--count)
+			freed++;
+			if (!--nr_to_scan)
 				break;
 		}
 		cond_resched_lock(&sb->s_dentry_lru_lock);
@@ -931,6 +933,7 @@ relock:
 	spin_unlock(&sb->s_dentry_lru_lock);
 
 	shrink_dentry_list(&tmp);
+	return freed;
 }
 
 /*
@@ -1317,9 +1320,8 @@ rename_retry:
 void shrink_dcache_parent(struct dentry * parent)
 {
 	LIST_HEAD(dispose);
-	int found;
 
-	while ((found = select_parent(parent, &dispose)) != 0)
+	while (select_parent(parent, &dispose))
 		shrink_dentry_list(&dispose);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
diff --git a/fs/inode.c b/fs/inode.c
index c9fb382..3624ae0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -691,10 +691,11 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-void prune_icache_sb(struct super_block *sb, int nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_scanned;
+	long nr_scanned;
+	long freed = 0;
 	unsigned long reap = 0;
 
 	spin_lock(&sb->s_inode_lru_lock);
@@ -764,6 +765,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 		list_move(&inode->i_lru, &freeable);
 		sb->s_nr_inodes_unused--;
 		this_cpu_dec(nr_unused);
+		freed++;
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -774,6 +776,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 		current->reclaim_state->reclaimed_slab += reap;
 
 	dispose_list(&freeable);
+	return freed;
 }
 
 static void __wait_on_freeing_inode(struct inode *inode);
diff --git a/fs/internal.h b/fs/internal.h
index 916b7cb..7d7908b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,6 +110,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+
 
 /*
  * fs-writeback.c
@@ -124,3 +126,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
diff --git a/fs/super.c b/fs/super.c
index 21abf02..fda6f12 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,11 +53,14 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
-static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
+static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
-	int	fs_objects = 0;
-	int	total_objects;
+	long	fs_objects = 0;
+	long	total_objects;
+	long	freed = 0;
+	long	dentries;
+	long	inodes;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -65,7 +68,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
 	 * to recurse into the FS that called us in clear_inode() and friends..
 	 */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
 
 	if (!grab_super_passive(sb))
@@ -77,33 +80,42 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	total_objects = sb->s_nr_dentry_unused +
 			sb->s_nr_inodes_unused + fs_objects + 1;
 
-	if (sc->nr_to_scan) {
-		int	dentries;
-		int	inodes;
-
-		/* proportion the scan between the caches */
-		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
-							total_objects;
-		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
-							total_objects;
-		if (fs_objects)
-			fs_objects = (sc->nr_to_scan * fs_objects) /
-							total_objects;
-		/*
-		 * prune the dcache first as the icache is pinned by it, then
-		 * prune the icache, followed by the filesystem specific caches
-		 */
-		prune_dcache_sb(sb, dentries);
-		prune_icache_sb(sb, inodes);
+	/* proportion the scan between the caches */
+	dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total_objects;
+	inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) / total_objects;
 
-		if (fs_objects && sb->s_op->free_cached_objects) {
-			sb->s_op->free_cached_objects(sb, fs_objects);
-			fs_objects = sb->s_op->nr_cached_objects(sb);
-		}
-		total_objects = sb->s_nr_dentry_unused +
-				sb->s_nr_inodes_unused + fs_objects;
+	/*
+	 * prune the dcache first as the icache is pinned by it, then
+	 * prune the icache, followed by the filesystem specific caches
+	 */
+	freed = prune_dcache_sb(sb, dentries);
+	freed += prune_icache_sb(sb, inodes);
+
+	if (fs_objects) {
+		fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
+		freed += sb->s_op->free_cached_objects(sb, fs_objects);
 	}
 
+	drop_super(sb);
+	return freed;
+}
+
+static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct super_block *sb;
+	long	total_objects = 0;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	if (!grab_super_passive(sb))
+		return -1;
+
+	if (sb->s_op && sb->s_op->nr_cached_objects)
+		total_objects = sb->s_op->nr_cached_objects(sb);
+
+	total_objects += sb->s_nr_dentry_unused;
+	total_objects += sb->s_nr_inodes_unused;
+
 	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
 	drop_super(sb);
 	return total_objects;
@@ -217,7 +229,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->cleancache_poolid = -1;
 
 		s->s_shrink.seeks = DEFAULT_SEEKS;
-		s->s_shrink.shrink = prune_super;
+		s->s_shrink.scan_objects = super_cache_scan;
+		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
 	}
 out:
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 96e344e..2f91e2b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1029,7 +1029,7 @@ STATIC int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
-	int			*nr_to_scan)
+	long			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
 	int			error = 0;
@@ -1150,7 +1150,7 @@ xfs_reclaim_inodes(
 	xfs_mount_t	*mp,
 	int		mode)
 {
-	int		nr_to_scan = INT_MAX;
+	long		nr_to_scan = LONG_MAX;
 
 	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
 }
@@ -1164,29 +1164,32 @@ xfs_reclaim_inodes(
  * them to be cleaned, which we hope will not be very long due to the
  * background walker having already kicked the IO off on those dirty inodes.
  */
-void
+long
 xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
-	int			nr_to_scan)
+	long			nr_to_scan)
 {
+	long nr = nr_to_scan;
+
 	/* kick background reclaimer and push the AIL */
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr);
+	return nr_to_scan - nr;
 }
 
 /*
  * Return the number of reclaimable inodes in the filesystem for
  * the shrinker to determine how much to reclaim.
  */
-int
+long
 xfs_reclaim_inodes_count(
 	struct xfs_mount	*mp)
 {
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		ag = 0;
-	int			reclaimable = 0;
+	long			reclaimable = 0;
 
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		ag = pag->pag_agno + 1;
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index e0f138c..c860d07 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -30,8 +30,8 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 void xfs_reclaim_worker(struct work_struct *work);
 
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
-int xfs_reclaim_inodes_count(struct xfs_mount *mp);
-void xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
+long xfs_reclaim_inodes_count(struct xfs_mount *mp);
+long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ab8839b..00aa61d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1514,19 +1514,19 @@ xfs_fs_mount(
 	return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
 }
 
-static int
+static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
 
-static void
+static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	int			nr_to_scan)
+	long			nr_to_scan)
 {
-	xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0845283..13833e3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1325,10 +1325,6 @@ struct super_block {
 	int s_readonly_remount;
 };
 
-/* superblock cache pruning functions */
-extern void prune_icache_sb(struct super_block *sb, int nr_to_scan);
-extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
-
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
@@ -1621,8 +1617,8 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	int (*nr_cached_objects)(struct super_block *);
-	void (*free_cached_objects)(struct super_block *, int);
+	long (*nr_cached_objects)(struct super_block *);
+	long (*free_cached_objects)(struct super_block *, long);
 };
 
 /*
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/19] list: add a new LRU list type
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (4 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 05/19] shrinker: convert superblock shrinkers to new API Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-28 16:10   ` Christoph Hellwig
  2012-11-27 23:14 ` [PATCH 07/19] inode: convert inode lru list to generic lru list code Dave Chinner
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Several subsystems use the same construct for LRU lists - a list
head, a spin lock and and item count. They also use exactly the same
code for adding and removing items from the LRU. Create a generic
type for these LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to
work with.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_lru.h |   36 ++++++++++++++
 lib/Makefile             |    2 +-
 lib/list_lru.c           |  117 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/list_lru.h
 create mode 100644 lib/list_lru.c

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
new file mode 100644
index 0000000..3423949
--- /dev/null
+++ b/include/linux/list_lru.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#ifndef _LRU_LIST_H
+#define _LRU_LIST_H 0
+
+#include <linux/list.h>
+
+struct list_lru {
+	spinlock_t		lock;
+	struct list_head	list;
+	long			nr_items;
+};
+
+int list_lru_init(struct list_lru *lru);
+int list_lru_add(struct list_lru *lru, struct list_head *item);
+int list_lru_del(struct list_lru *lru, struct list_head *item);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+	return lru->nr_items;
+}
+
+typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
+				void *cb_arg);
+typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
+
+long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+		   void *cb_arg, long nr_to_walk);
+
+long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
+
+#endif /* _LRU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..a0849d7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o list_lru.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/list_lru.c b/lib/list_lru.c
new file mode 100644
index 0000000..475d0e9
--- /dev/null
+++ b/lib/list_lru.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_lru.h>
+
+int
+list_lru_add(
+	struct list_lru	*lru,
+	struct list_head *item)
+{
+	spin_lock(&lru->lock);
+	if (list_empty(item)) {
+		list_add_tail(item, &lru->list);
+		lru->nr_items++;
+		spin_unlock(&lru->lock);
+		return 1;
+	}
+	spin_unlock(&lru->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_add);
+
+int
+list_lru_del(
+	struct list_lru	*lru,
+	struct list_head *item)
+{
+	spin_lock(&lru->lock);
+	if (!list_empty(item)) {
+		list_del_init(item);
+		lru->nr_items--;
+		spin_unlock(&lru->lock);
+		return 1;
+	}
+	spin_unlock(&lru->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_del);
+
+long
+list_lru_walk(
+	struct list_lru *lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	long		nr_to_walk)
+{
+	struct list_head *item, *n;
+	long removed = 0;
+restart:
+	spin_lock(&lru->lock);
+	list_for_each_safe(item, n, &lru->list) {
+		int ret;
+
+		if (nr_to_walk-- < 0)
+			break;
+
+		ret = isolate(item, &lru->lock, cb_arg);
+		switch (ret) {
+		case 0:	/* item removed from list */
+			lru->nr_items--;
+			removed++;
+			break;
+		case 1: /* item referenced, give another pass */
+			list_move_tail(item, &lru->list);
+			break;
+		case 2: /* item cannot be locked, skip */
+			break;
+		case 3: /* item not freeable, lock dropped */
+			goto restart;
+		default:
+			BUG();
+		}
+	}
+	spin_unlock(&lru->lock);
+	return removed;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+long
+list_lru_dispose_all(
+	struct list_lru *lru,
+	list_lru_dispose_cb dispose)
+{
+	long disposed = 0;
+	LIST_HEAD(dispose_list);
+
+	spin_lock(&lru->lock);
+	while (!list_empty(&lru->list)) {
+		list_splice_init(&lru->list, &dispose_list);
+		disposed += lru->nr_items;
+		lru->nr_items = 0;
+		spin_unlock(&lru->lock);
+
+		dispose(&dispose_list);
+
+		spin_lock(&lru->lock);
+	}
+	spin_unlock(&lru->lock);
+	return disposed;
+}
+
+int
+list_lru_init(
+	struct list_lru	*lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_init);
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/19] inode: convert inode lru list to generic lru list code.
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (5 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 06/19] list: add a new LRU list type Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 08/19] dcache: convert to use new lru list infrastructure Dave Chinner
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c         |  173 +++++++++++++++++++++-------------------------------
 fs/super.c         |   11 ++--
 include/linux/fs.h |    6 +-
 3 files changed, 75 insertions(+), 115 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 3624ae0..2662305 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
 #include <linux/prefetch.h>
 #include <linux/buffer_head.h> /* for inode_has_buffers */
 #include <linux/ratelimit.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 
 /*
@@ -24,7 +25,7 @@
  *
  * inode->i_lock protects:
  *   inode->i_state, inode->i_hash, __iget()
- * inode->i_sb->s_inode_lru_lock protects:
+ * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
@@ -37,7 +38,7 @@
  *
  * inode_sb_list_lock
  *   inode->i_lock
- *     inode->i_sb->s_inode_lru_lock
+ *     Inode LRU list locks
  *
  * bdi->wb.list_lock
  *   inode->i_lock
@@ -399,24 +400,14 @@ EXPORT_SYMBOL(ihold);
 
 static void inode_lru_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_lru_lock);
-	if (list_empty(&inode->i_lru)) {
-		list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
-		inode->i_sb->s_nr_inodes_unused++;
+	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_inc(nr_unused);
-	}
-	spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 static void inode_lru_list_del(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_lru_lock);
-	if (!list_empty(&inode->i_lru)) {
-		list_del_init(&inode->i_lru);
-		inode->i_sb->s_nr_inodes_unused--;
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_dec(nr_unused);
-	}
-	spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 /**
@@ -660,24 +651,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	return busy;
 }
 
-static int can_unuse(struct inode *inode)
-{
-	if (inode->i_state & ~I_REFERENCED)
-		return 0;
-	if (inode_has_buffers(inode))
-		return 0;
-	if (atomic_read(&inode->i_count))
-		return 0;
-	if (inode->i_data.nrpages)
-		return 0;
-	return 1;
-}
-
 /*
- * Walk the superblock inode LRU for freeable inodes and attempt to free them.
- * This is called from the superblock shrinker function with a number of inodes
- * to trim from the LRU. Inodes to be freed are moved to a temporary list and
- * then are freed outside inode_lock by dispose_list().
+ * Isolate the inode from the LRU in preparation for freeing it.
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  If the inode has metadata buffers attached to
@@ -691,90 +666,78 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+				void *arg)
 {
-	LIST_HEAD(freeable);
-	long nr_scanned;
-	long freed = 0;
-	unsigned long reap = 0;
+	struct list_head *freeable = arg;
+	struct inode	*inode = container_of(item, struct inode, i_lru);
 
-	spin_lock(&sb->s_inode_lru_lock);
-	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
-		struct inode *inode;
+	/*
+	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
+	 * If we fail to get the lock, just skip it.
+	 */
+	if (!spin_trylock(&inode->i_lock))
+		return 2;
 
-		if (list_empty(&sb->s_inode_lru))
-			break;
+	/*
+	 * Referenced or dirty inodes are still in use. Give them another pass
+	 * through the LRU as we canot reclaim them now.
+	 */
+	if (atomic_read(&inode->i_count) ||
+	    (inode->i_state & ~I_REFERENCED)) {
+		list_del_init(&inode->i_lru);
+		spin_unlock(&inode->i_lock);
+		this_cpu_dec(nr_unused);
+		return 0;
+	}
 
-		inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);
+	/* recently referenced inodes get one more pass */
+	if (inode->i_state & I_REFERENCED) {
+		inode->i_state &= ~I_REFERENCED;
+		spin_unlock(&inode->i_lock);
+		return 1;
+	}
 
-		/*
-		 * we are inverting the sb->s_inode_lru_lock/inode->i_lock here,
-		 * so use a trylock. If we fail to get the lock, just move the
-		 * inode to the back of the list so we don't spin on it.
-		 */
-		if (!spin_trylock(&inode->i_lock)) {
-			list_move_tail(&inode->i_lru, &sb->s_inode_lru);
-			continue;
+	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(lru_lock);
+		if (remove_inode_buffers(inode)) {
+			unsigned long reap;
+			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
+			if (current_is_kswapd())
+				__count_vm_events(KSWAPD_INODESTEAL, reap);
+			else
+				__count_vm_events(PGINODESTEAL, reap);
+			if (current->reclaim_state)
+				current->reclaim_state->reclaimed_slab += reap;
 		}
+		iput(inode);
+		return 3;
+	}
 
-		/*
-		 * Referenced or dirty inodes are still in use. Give them
-		 * another pass through the LRU as we canot reclaim them now.
-		 */
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_lru);
-			spin_unlock(&inode->i_lock);
-			sb->s_nr_inodes_unused--;
-			this_cpu_dec(nr_unused);
-			continue;
-		}
+	WARN_ON(inode->i_state & I_NEW);
+	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
-		/* recently referenced inodes get one more pass */
-		if (inode->i_state & I_REFERENCED) {
-			inode->i_state &= ~I_REFERENCED;
-			list_move(&inode->i_lru, &sb->s_inode_lru);
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
-			spin_unlock(&inode->i_lock);
-			spin_unlock(&sb->s_inode_lru_lock);
-			if (remove_inode_buffers(inode))
-				reap += invalidate_mapping_pages(&inode->i_data,
-								0, -1);
-			iput(inode);
-			spin_lock(&sb->s_inode_lru_lock);
-
-			if (inode != list_entry(sb->s_inode_lru.next,
-						struct inode, i_lru))
-				continue;	/* wrong inode or list_empty */
-			/* avoid lock inversions with trylock */
-			if (!spin_trylock(&inode->i_lock))
-				continue;
-			if (!can_unuse(inode)) {
-				spin_unlock(&inode->i_lock);
-				continue;
-			}
-		}
-		WARN_ON(inode->i_state & I_NEW);
-		inode->i_state |= I_FREEING;
-		spin_unlock(&inode->i_lock);
+	list_move(&inode->i_lru, freeable);
+	this_cpu_dec(nr_unused);
+	return 0;
+}
 
-		list_move(&inode->i_lru, &freeable);
-		sb->s_nr_inodes_unused--;
-		this_cpu_dec(nr_unused);
-		freed++;
-	}
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_INODESTEAL, reap);
-	else
-		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&sb->s_inode_lru_lock);
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += reap;
+/*
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
+ */
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+{
+	LIST_HEAD(freeable);
+	long freed;
 
+	freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
+						&freeable, nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/super.c b/fs/super.c
index fda6f12..a2f09c8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,12 +77,12 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb);
 
-	total_objects = sb->s_nr_dentry_unused +
-			sb->s_nr_inodes_unused + fs_objects + 1;
+	inodes = list_lru_count(&sb->s_inode_lru);
+	total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total_objects;
-	inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) / total_objects;
+	inodes = (sc->nr_to_scan * inodes) / total_objects;
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
@@ -114,7 +114,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 		total_objects = sb->s_op->nr_cached_objects(sb);
 
 	total_objects += sb->s_nr_dentry_unused;
-	total_objects += sb->s_nr_inodes_unused;
+	total_objects += list_lru_count(&sb->s_inode_lru);
 
 	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
 	drop_super(sb);
@@ -195,8 +195,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		spin_lock_init(&s->s_dentry_lru_lock);
-		INIT_LIST_HEAD(&s->s_inode_lru);
-		spin_lock_init(&s->s_inode_lru_lock);
+		list_lru_init(&s->s_inode_lru);
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13833e3..36b7db5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
 #include <linux/stat.h>
 #include <linux/cache.h>
 #include <linux/list.h>
+#include <linux/list_lru.h>
 #include <linux/radix-tree.h>
 #include <linux/rbtree.h>
 #include <linux/init.h>
@@ -1268,10 +1269,7 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
-	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
-	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inode_lru;		/* unused inode lru */
-	int			s_nr_inodes_unused;	/* # of inodes on lru */
+	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/19] dcache: convert to use new lru list infrastructure
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (6 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 07/19] inode: convert inode lru list to generic lru list code Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c        |  171 +++++++++++++++++++++-------------------------------
 fs/super.c         |   10 +--
 include/linux/fs.h |   15 +++--
 3 files changed, 82 insertions(+), 114 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca647b8..d72e388 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/prefetch.h>
 #include <linux/ratelimit.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -318,20 +319,8 @@ static void dentry_unlink_inode(struct dentry * dentry)
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
-	if (list_empty(&dentry->d_lru)) {
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-		dentry->d_sb->s_nr_dentry_unused++;
+	if (list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
 		this_cpu_inc(nr_dentry_unused);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-	}
-}
-
-static void __dentry_lru_del(struct dentry *dentry)
-{
-	list_del_init(&dentry->d_lru);
-	dentry->d_sb->s_nr_dentry_unused--;
-	this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -341,11 +330,8 @@ static void dentry_lru_del(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
 
-	if (!list_empty(&dentry->d_lru)) {
-		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-		__dentry_lru_del(dentry);
-		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-	}
+	if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
+		this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -361,35 +347,19 @@ static void dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_prune(struct dentry *dentry)
 {
-	if (!list_empty(&dentry->d_lru)) {
+	int prune = dentry->d_flags & DCACHE_OP_PRUNE;
 
-		if (dentry->d_flags & DCACHE_OP_PRUNE)
-			dentry->d_op->d_prune(dentry);
-
-		if ((dentry->d_flags & DCACHE_SHRINK_LIST))
-			list_del_init(&dentry->d_lru);
-		else {
-			spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-			__dentry_lru_del(dentry);
-			spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-		}
-		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
-	}
-}
-
-static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
-{
-	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
-
-	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-	if (list_empty(&dentry->d_lru)) {
-		list_add_tail(&dentry->d_lru, list);
-	} else {
-		list_move_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused--;
+	if (!list_empty(&dentry->d_lru) &&
+	    (dentry->d_flags & DCACHE_SHRINK_LIST))
+		list_del_init(&dentry->d_lru);
+	else if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
 		this_cpu_dec(nr_dentry_unused);
-	}
-	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+	else
+		prune = 0;
+
+	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+	if (prune)
+		dentry->d_op->d_prune(dentry);
 }
 
 /**
@@ -880,6 +850,51 @@ static void shrink_dentry_list(struct list_head *list)
 	rcu_read_unlock();
 }
 
+static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+				void *arg)
+{
+	struct list_head *freeable = arg;
+	struct dentry	*dentry = container_of(item, struct dentry, d_lru);
+
+
+	/*
+	 * we are inverting the lru lock/dentry->d_lock here,
+	 * so use a trylock. If we fail to get the lock, just skip
+	 * it
+	 */
+	if (!spin_trylock(&dentry->d_lock))
+		return 2;
+
+	/*
+	 * Referenced dentries are still in use. If they have active
+	 * counts, just remove them from the LRU. Otherwise give them
+	 * another pass through the LRU.
+	 */
+	if (dentry->d_count) {
+		list_del_init(&dentry->d_lru);
+		spin_unlock(&dentry->d_lock);
+		return 0;
+	}
+
+	if (dentry->d_flags & DCACHE_REFERENCED) {
+		dentry->d_flags &= ~DCACHE_REFERENCED;
+		spin_unlock(&dentry->d_lock);
+
+		/*
+		 * XXX: this list move should be be done under d_lock. Need to
+		 * determine if it is safe just to do it under the lru lock.
+		 */
+		return 1;
+	}
+
+	dentry->d_flags |= DCACHE_SHRINK_LIST;
+	list_move_tail(&dentry->d_lru, freeable);
+	this_cpu_dec(nr_dentry_unused);
+	spin_unlock(&dentry->d_lock);
+
+	return 0;
+}
+
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
@@ -894,45 +909,12 @@ static void shrink_dentry_list(struct list_head *list)
  */
 long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
 {
-	struct dentry *dentry;
-	LIST_HEAD(referenced);
-	LIST_HEAD(tmp);
-	long freed = 0;
-
-relock:
-	spin_lock(&sb->s_dentry_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		dentry = list_entry(sb->s_dentry_lru.prev,
-				struct dentry, d_lru);
-		BUG_ON(dentry->d_sb != sb);
-
-		if (!spin_trylock(&dentry->d_lock)) {
-			spin_unlock(&sb->s_dentry_lru_lock);
-			cpu_relax();
-			goto relock;
-		}
-
-		if (dentry->d_flags & DCACHE_REFERENCED) {
-			dentry->d_flags &= ~DCACHE_REFERENCED;
-			list_move(&dentry->d_lru, &referenced);
-			spin_unlock(&dentry->d_lock);
-		} else {
-			list_move_tail(&dentry->d_lru, &tmp);
-			dentry->d_flags |= DCACHE_SHRINK_LIST;
-			this_cpu_dec(nr_dentry_unused);
-			sb->s_nr_dentry_unused--;
-			spin_unlock(&dentry->d_lock);
-			freed++;
-			if (!--nr_to_scan)
-				break;
-		}
-		cond_resched_lock(&sb->s_dentry_lru_lock);
-	}
-	if (!list_empty(&referenced))
-		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&sb->s_dentry_lru_lock);
+	LIST_HEAD(dispose);
+	long freed;
 
-	shrink_dentry_list(&tmp);
+	freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
+			      &dispose, nr_to_scan);
+	shrink_dentry_list(&dispose);
 	return freed;
 }
 
@@ -967,24 +949,10 @@ shrink_dcache_list(
  */
 void shrink_dcache_sb(struct super_block *sb)
 {
-	LIST_HEAD(tmp);
-
-	spin_lock(&sb->s_dentry_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		list_splice_init(&sb->s_dentry_lru, &tmp);
-
-		/*
-		 * account for removal here so we don't need to handle it later
-		 * even though the dentry is no longer on the lru list.
-		 */
-		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
-		sb->s_nr_dentry_unused = 0;
+	long freed;
 
-		spin_unlock(&sb->s_dentry_lru_lock);
-		shrink_dcache_list(&tmp);
-		spin_lock(&sb->s_dentry_lru_lock);
-	}
-	spin_unlock(&sb->s_dentry_lru_lock);
+	freed = list_lru_dispose_all(&sb->s_dentry_lru, shrink_dcache_list);
+	this_cpu_sub(nr_dentry_unused, freed);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -1255,7 +1223,8 @@ resume:
 		if (dentry->d_count) {
 			dentry_lru_del(dentry);
 		} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
-			dentry_lru_move_list(dentry, dispose);
+			dentry_lru_del(dentry);
+			list_add_tail(&dentry->d_lru, dispose);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
 			found++;
 		}
diff --git a/fs/super.c b/fs/super.c
index a2f09c8..b1d24ef 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,10 +78,11 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 		fs_objects = sb->s_op->nr_cached_objects(sb);
 
 	inodes = list_lru_count(&sb->s_inode_lru);
-	total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
+	dentries = list_lru_count(&sb->s_dentry_lru);
+	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
-	dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total_objects;
+	dentries = (sc->nr_to_scan * dentries) / total_objects;
 	inodes = (sc->nr_to_scan * inodes) / total_objects;
 
 	/*
@@ -113,7 +114,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb);
 
-	total_objects += sb->s_nr_dentry_unused;
+	total_objects += list_lru_count(&sb->s_dentry_lru);
 	total_objects += list_lru_count(&sb->s_inode_lru);
 
 	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
@@ -193,8 +194,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
-		INIT_LIST_HEAD(&s->s_dentry_lru);
-		spin_lock_init(&s->s_dentry_lru_lock);
+		list_lru_init(&s->s_dentry_lru);
 		list_lru_init(&s->s_inode_lru);
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 36b7db5..befa46f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1263,14 +1263,6 @@ struct super_block {
 	struct list_head	s_files;
 #endif
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-
-	/* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
-	spinlock_t		s_dentry_lru_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_dentry_lru;	/* unused dentry lru */
-	int			s_nr_dentry_unused;	/* # of dentry on lru */
-
-	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
-
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
@@ -1321,6 +1313,13 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/*
+	 * Keep the lru lists last in the structure so they always sit on their
+	 * own individual cachelines.
+	 */
+	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
+	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/19] list_lru: per-node list infrastructure
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (7 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 08/19] dcache: convert to use new lru list infrastructure Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-12-20 11:21   ` Glauber Costa
  2013-01-16 19:21   ` Glauber Costa
  2012-11-27 23:14 ` [PATCH 10/19] shrinker: add node awareness Dave Chinner
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have an LRU list API, we can start to enhance the
implementation.  This splits the single LRU list into per-node lists
and locks to enhance scalability. Items are placed on lists
according to the node the memory belongs to. To make scanning the
lists efficient, also track whether the per-node lists have entries
in them in a active nodemask.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_lru.h |   14 ++--
 lib/list_lru.c           |  160 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 129 insertions(+), 45 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3423949..b0e3ba2 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -8,21 +8,23 @@
 #define _LRU_LIST_H 0
 
 #include <linux/list.h>
+#include <linux/nodemask.h>
 
-struct list_lru {
+struct list_lru_node {
 	spinlock_t		lock;
 	struct list_head	list;
 	long			nr_items;
+} ____cacheline_aligned_in_smp;
+
+struct list_lru {
+	struct list_lru_node	node[MAX_NUMNODES];
+	nodemask_t		active_nodes;
 };
 
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-
-static inline long list_lru_count(struct list_lru *lru)
-{
-	return lru->nr_items;
-}
+long list_lru_count(struct list_lru *lru);
 
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 475d0e9..881e342 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -6,6 +6,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 #include <linux/list_lru.h>
 
 int
@@ -13,14 +14,19 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	spin_lock(&lru->lock);
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	spin_lock(&nlru->lock);
+	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
-		list_add_tail(item, &lru->list);
-		lru->nr_items++;
-		spin_unlock(&lru->lock);
+		list_add_tail(item, &nlru->list);
+		if (nlru->nr_items++ == 0)
+			node_set(nid, lru->active_nodes);
+		spin_unlock(&nlru->lock);
 		return 1;
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
@@ -30,43 +36,72 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	spin_lock(&lru->lock);
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		lru->nr_items--;
-		spin_unlock(&lru->lock);
+		if (--nlru->nr_items == 0)
+			node_clear(nid, lru->active_nodes);
+		BUG_ON(nlru->nr_items < 0);
+		spin_unlock(&nlru->lock);
 		return 1;
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_walk(
-	struct list_lru *lru,
-	list_lru_walk_cb isolate,
-	void		*cb_arg,
-	long		nr_to_walk)
+list_lru_count(
+	struct list_lru *lru)
 {
+	long count = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		struct list_lru_node *nlru = &lru->node[nid];
+
+		spin_lock(&nlru->lock);
+		BUG_ON(nlru->nr_items < 0);
+		count += nlru->nr_items;
+		spin_unlock(&nlru->lock);
+	}
+
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static long
+list_lru_walk_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_walk_cb	isolate,
+	void			*cb_arg,
+	long			*nr_to_walk)
+{
+	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
-	long removed = 0;
+	long isolated = 0;
 restart:
-	spin_lock(&lru->lock);
-	list_for_each_safe(item, n, &lru->list) {
+	spin_lock(&nlru->lock);
+	list_for_each_safe(item, n, &nlru->list) {
 		int ret;
 
-		if (nr_to_walk-- < 0)
+		if ((*nr_to_walk)-- < 0)
 			break;
 
-		ret = isolate(item, &lru->lock, cb_arg);
+		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case 0:	/* item removed from list */
-			lru->nr_items--;
-			removed++;
+			if (--nlru->nr_items == 0)
+				node_clear(nid, lru->active_nodes);
+			BUG_ON(nlru->nr_items < 0);
+			isolated++;
 			break;
 		case 1: /* item referenced, give another pass */
-			list_move_tail(item, &lru->list);
+			list_move_tail(item, &nlru->list);
 			break;
 		case 2: /* item cannot be locked, skip */
 			break;
@@ -76,42 +111,89 @@ restart:
 			BUG();
 		}
 	}
-	spin_unlock(&lru->lock);
-	return removed;
+	spin_unlock(&nlru->lock);
+	return isolated;
+}
+
+long
+list_lru_walk(
+	struct list_lru	*lru,
+	list_lru_walk_cb isolate,
+	void		*cb_arg,
+	long		nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+	return isolated;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk);
 
 long
-list_lru_dispose_all(
-	struct list_lru *lru,
-	list_lru_dispose_cb dispose)
+list_lru_dispose_all_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_dispose_cb	dispose)
 {
-	long disposed = 0;
+	struct list_lru_node	*nlru = &lru->node[nid];
 	LIST_HEAD(dispose_list);
+	long disposed = 0;
 
-	spin_lock(&lru->lock);
-	while (!list_empty(&lru->list)) {
-		list_splice_init(&lru->list, &dispose_list);
-		disposed += lru->nr_items;
-		lru->nr_items = 0;
-		spin_unlock(&lru->lock);
+	spin_lock(&nlru->lock);
+	while (!list_empty(&nlru->list)) {
+		list_splice_init(&nlru->list, &dispose_list);
+		disposed += nlru->nr_items;
+		nlru->nr_items = 0;
+		node_clear(nid, lru->active_nodes);
+		spin_unlock(&nlru->lock);
 
 		dispose(&dispose_list);
 
-		spin_lock(&lru->lock);
+		spin_lock(&nlru->lock);
 	}
-	spin_unlock(&lru->lock);
+	spin_unlock(&nlru->lock);
 	return disposed;
 }
 
+long
+list_lru_dispose_all(
+	struct list_lru		*lru,
+	list_lru_dispose_cb	dispose)
+{
+	long disposed;
+	long total = 0;
+	int nid;
+
+	do {
+		disposed = 0;
+		for_each_node_mask(nid, lru->active_nodes) {
+			disposed += list_lru_dispose_all_node(lru, nid,
+							      dispose);
+		}
+		total += disposed;
+	} while (disposed != 0);
+
+	return total;
+}
+
 int
 list_lru_init(
 	struct list_lru	*lru)
 {
-	spin_lock_init(&lru->lock);
-	INIT_LIST_HEAD(&lru->list);
-	lru->nr_items = 0;
+	int i;
 
+	nodes_clear(lru->active_nodes);
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		spin_lock_init(&lru->node[i].lock);
+		INIT_LIST_HEAD(&lru->node[i].list);
+		lru->node[i].nr_items = 0;
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_init);
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/19] shrinker: add node awareness
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (8 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware Dave Chinner
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for
node aware shrinkers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c         |    1 +
 include/linux/shrinker.h |    3 +++
 mm/memory-failure.c      |    2 ++
 mm/vmscan.c              |   12 +++++++++---
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c00e055..9fd702f 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -44,6 +44,7 @@ static void drop_slab(void)
 		.gfp_mask = GFP_KERNEL,
 	};
 
+	nodes_setall(shrink.nodes_to_scan);
 	do {
 		nr_objects = shrink_slab(&shrink, 1000, 1000);
 	} while (nr_objects > 10);
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 4f59615..e71286f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -16,6 +16,9 @@ struct shrink_control {
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
 	long nr_to_scan;
+
+	/* shrink from these nodes */
+	nodemask_t nodes_to_scan;
 };
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..7bcbde4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -248,10 +248,12 @@ void shake_page(struct page *p, int access)
 	 */
 	if (access) {
 		int nr;
+		int nid = page_to_nid(p);
 		do {
 			struct shrink_control shrink = {
 				.gfp_mask = GFP_KERNEL,
 			};
+			node_set(nid, shrink.nodes_to_scan);
 
 			nr = shrink_slab(&shrink, 1000, 1000);
 			if (page_count(p) == 1)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 55c4fc9..4a602ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,15 +2122,20 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
+
+			nodes_clear(shrink->nodes_to_scan);
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
 				lru_pages += zone_reclaimable_pages(zone);
+				node_set(zone_to_nid(zone),
+					 shrink->nodes_to_scan);
 			}
 
 			shrink_slab(shrink, sc->nr_scanned, lru_pages);
+
 			if (reclaim_state) {
 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 				reclaim_state->reclaimed_slab = 0;
@@ -2682,6 +2687,8 @@ loop_again:
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
+				nodes_clear(shrink.nodes_to_scan);
+				node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
@@ -3318,10 +3325,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * number of slab pages and shake the slab until it is reduced
 		 * by the same nr_pages that we used for reclaiming unmapped
 		 * pages.
-		 *
-		 * Note that shrink_slab will free memory on all zones and may
-		 * take a long time.
 		 */
+		nodes_clear(shrink.nodes_to_scan);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 		for (;;) {
 			unsigned long lru_pages = zone_reclaimable_pages(zone);
 
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (9 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 10/19] shrinker: add node awareness Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 12/19] xfs: convert buftarg LRU to generic code Dave Chinner
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Now that the shrinker is passing a nodemask in the scan control
structure, we can pass this to the the generic LRU list code to
isolate reclaim to the lists on matching nodes.

This requires a small amount of refactoring of the LRU list API,
which might be best split out into a separate patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c              |    7 ++++---
 fs/inode.c               |    7 ++++---
 fs/internal.h            |    6 ++++--
 fs/super.c               |   22 +++++++++++++---------
 fs/xfs/xfs_super.c       |    6 ++++--
 include/linux/fs.h       |    4 ++--
 include/linux/list_lru.h |   19 ++++++++++++++++---
 lib/list_lru.c           |   18 ++++++++++--------
 8 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d72e388..7f107fb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -907,13 +907,14 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+		     nodemask_t *nodes_to_walk)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
-			      &dispose, nr_to_scan);
+	freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
+				       &dispose, nr_to_scan, nodes_to_walk);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 2662305..3857f9f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -731,13 +731,14 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+		     nodemask_t *nodes_to_walk)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
-						&freeable, nr_to_scan);
+	freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
+				       &freeable, nr_to_scan, nodes_to_walk);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 7d7908b..95c4e9b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,7 +110,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+			    nodemask_t *nodes_to_scan);
 
 
 /*
@@ -126,4 +127,5 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
-extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+			    nodemask_t *nodes_to_scan);
diff --git a/fs/super.c b/fs/super.c
index b1d24ef..3c975b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,10 +75,10 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 		return -1;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb);
+		fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);
 
-	inodes = list_lru_count(&sb->s_inode_lru);
-	dentries = list_lru_count(&sb->s_dentry_lru);
+	inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
+	dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,12 +89,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries);
-	freed += prune_icache_sb(sb, inodes);
+	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
+	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
 
 	if (fs_objects) {
 		fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
-		freed += sb->s_op->free_cached_objects(sb, fs_objects);
+		freed += sb->s_op->free_cached_objects(sb, fs_objects,
+						       &sc->nodes_to_scan);
 	}
 
 	drop_super(sb);
@@ -112,10 +113,13 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 		return -1;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb);
+		total_objects = sb->s_op->nr_cached_objects(sb,
+						 &sc->nodes_to_scan);
 
-	total_objects += list_lru_count(&sb->s_dentry_lru);
-	total_objects += list_lru_count(&sb->s_inode_lru);
+	total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
+						 &sc->nodes_to_scan);
+	total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
+						 &sc->nodes_to_scan);
 
 	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
 	drop_super(sb);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 00aa61d..33d67d5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1516,7 +1516,8 @@ xfs_fs_mount(
 
 static long
 xfs_fs_nr_cached_objects(
-	struct super_block	*sb)
+	struct super_block	*sb,
+	nodemask_t		*nodes_to_count)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1524,7 +1525,8 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan)
+	long			nr_to_scan,
+	nodemask_t		*nodes_to_scan)
 {
 	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index befa46f..fda4ee2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1614,8 +1614,8 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *);
-	long (*free_cached_objects)(struct super_block *, long);
+	long (*nr_cached_objects)(struct super_block *, nodemask_t *);
+	long (*free_cached_objects)(struct super_block *, long, nodemask_t *);
 };
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b0e3ba2..02796da 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -24,14 +24,27 @@ struct list_lru {
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count(struct list_lru *lru);
+long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+	return list_lru_count_nodemask(lru, &lru->active_nodes);
+}
+
 
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, long nr_to_walk);
+long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
+		   void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+
+static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+				 void *cb_arg, long nr_to_walk)
+{
+	return list_lru_walk_nodemask(lru, isolate, cb_arg, nr_to_walk,
+				      &lru->active_nodes);
+}
 
 long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 881e342..0f08ed6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -54,13 +54,14 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_count(
-	struct list_lru *lru)
+list_lru_count_nodemask(
+	struct list_lru *lru,
+	nodemask_t	*nodes_to_count)
 {
 	long count = 0;
 	int nid;
 
-	for_each_node_mask(nid, lru->active_nodes) {
+	for_each_node_mask(nid, *nodes_to_count) {
 		struct list_lru_node *nlru = &lru->node[nid];
 
 		spin_lock(&nlru->lock);
@@ -71,7 +72,7 @@ list_lru_count(
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
 
 static long
 list_lru_walk_node(
@@ -116,16 +117,17 @@ restart:
 }
 
 long
-list_lru_walk(
+list_lru_walk_nodemask(
 	struct list_lru	*lru,
 	list_lru_walk_cb isolate,
 	void		*cb_arg,
-	long		nr_to_walk)
+	long		nr_to_walk,
+	nodemask_t	*nodes_to_walk)
 {
 	long isolated = 0;
 	int nid;
 
-	for_each_node_mask(nid, lru->active_nodes) {
+	for_each_node_mask(nid, *nodes_to_walk) {
 		isolated += list_lru_walk_node(lru, nid, isolate,
 					       cb_arg, &nr_to_walk);
 		if (nr_to_walk <= 0)
@@ -133,7 +135,7 @@ list_lru_walk(
 	}
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);
 
 long
 list_lru_dispose_all_node(
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/19] xfs: convert buftarg LRU to generic code
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (10 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 13/19] xfs: Node aware direct inode reclaim Dave Chinner
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the buftarg LRU to use the new generic LRU list and take
advantage of the functionality it supplies to make the buffer cache
shrinker node aware.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c |  162 +++++++++++++++++++++++++-----------------------------
 fs/xfs/xfs_buf.h |    5 +-
 2 files changed, 76 insertions(+), 91 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a80195b..1011c59 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
  * The LRU takes a new reference to the buffer so that it will only be freed
  * once the shrinker takes the buffer off the LRU.
  */
-STATIC void
+static void
 xfs_buf_lru_add(
 	struct xfs_buf	*bp)
 {
-	struct xfs_buftarg *btp = bp->b_target;
-
-	spin_lock(&btp->bt_lru_lock);
-	if (list_empty(&bp->b_lru)) {
-		atomic_inc(&bp->b_hold);
-		list_add_tail(&bp->b_lru, &btp->bt_lru);
-		btp->bt_lru_nr++;
+	if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
 		bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+		atomic_inc(&bp->b_hold);
 	}
-	spin_unlock(&btp->bt_lru_lock);
 }
 
 /*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
  * The unlocked check is safe here because it only occurs when there are not
  * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
  * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
  */
-STATIC void
+static void
 xfs_buf_lru_del(
 	struct xfs_buf	*bp)
 {
-	struct xfs_buftarg *btp = bp->b_target;
-
-	if (list_empty(&bp->b_lru))
-		return;
-
-	spin_lock(&btp->bt_lru_lock);
-	if (!list_empty(&bp->b_lru)) {
-		list_del_init(&bp->b_lru);
-		btp->bt_lru_nr--;
-	}
-	spin_unlock(&btp->bt_lru_lock);
+	list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
 }
 
 /*
@@ -151,18 +134,10 @@ xfs_buf_stale(
 	bp->b_flags &= ~_XBF_DELWRI_Q;
 
 	atomic_set(&(bp)->b_lru_ref, 0);
-	if (!list_empty(&bp->b_lru)) {
-		struct xfs_buftarg *btp = bp->b_target;
-
-		spin_lock(&btp->bt_lru_lock);
-		if (!list_empty(&bp->b_lru) &&
-		    !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
-			list_del_init(&bp->b_lru);
-			btp->bt_lru_nr--;
-			atomic_dec(&bp->b_hold);
-		}
-		spin_unlock(&btp->bt_lru_lock);
-	}
+	if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+	    (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
+		atomic_dec(&bp->b_hold);
+
 	ASSERT(atomic_read(&bp->b_hold) >= 1);
 }
 
@@ -1476,81 +1451,92 @@ xfs_buf_iomove(
  * returned. These buffers will have an elevated hold count, so wait on those
  * while freeing all the buffers only held by the LRU.
  */
-void
-xfs_wait_buftarg(
-	struct xfs_buftarg	*btp)
+static int
+xfs_buftarg_wait_rele(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
 {
-	struct xfs_buf		*bp;
+	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
 
-restart:
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-		if (atomic_read(&bp->b_hold) > 1) {
-			spin_unlock(&btp->bt_lru_lock);
-			delay(100);
-			goto restart;
-		}
+	if (atomic_read(&bp->b_hold) > 1) {
+		/* need to wait */
+		spin_unlock(lru_lock);
+		delay(100);
+	} else {
 		/*
 		 * clear the LRU reference count so the buffer doesn't get
 		 * ignored in xfs_buf_rele().
 		 */
 		atomic_set(&bp->b_lru_ref, 0);
-		spin_unlock(&btp->bt_lru_lock);
+		spin_unlock(lru_lock);
 		xfs_buf_rele(bp);
-		spin_lock(&btp->bt_lru_lock);
 	}
-	spin_unlock(&btp->bt_lru_lock);
+	return 3;
 }
 
-int
-xfs_buftarg_shrink(
+void
+xfs_wait_buftarg(
+	struct xfs_buftarg	*btp)
+{
+	while (list_lru_count(&btp->bt_lru))
+		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
+			      NULL, LONG_MAX);
+}
+
+static int
+xfs_buftarg_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+{
+	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
+	struct list_head	*dispose = arg;
+
+	/*
+	 * Decrement the b_lru_ref count unless the value is already
+	 * zero. If the value is already zero, we need to reclaim the
+	 * buffer, otherwise it gets another trip through the LRU.
+	 */
+	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0))
+		return 1;
+
+	bp->b_lru_flags |= _XBF_LRU_DISPOSE;
+	list_move(item, dispose);
+	return 0;
+}
+
+static long
+xfs_buftarg_shrink_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	struct xfs_buf		*bp;
-	int nr_to_scan = sc->nr_to_scan;
 	LIST_HEAD(dispose);
+	long			freed;
 
-	if (!nr_to_scan)
-		return btp->bt_lru_nr;
-
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		if (nr_to_scan-- <= 0)
-			break;
-
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-
-		/*
-		 * Decrement the b_lru_ref count unless the value is already
-		 * zero. If the value is already zero, we need to reclaim the
-		 * buffer, otherwise it gets another trip through the LRU.
-		 */
-		if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
-			list_move_tail(&bp->b_lru, &btp->bt_lru);
-			continue;
-		}
-
-		/*
-		 * remove the buffer from the LRU now to avoid needing another
-		 * lock round trip inside xfs_buf_rele().
-		 */
-		list_move(&bp->b_lru, &dispose);
-		btp->bt_lru_nr--;
-		bp->b_lru_flags |= _XBF_LRU_DISPOSE;
-	}
-	spin_unlock(&btp->bt_lru_lock);
+	freed = list_lru_walk_nodemask(&btp->bt_lru, xfs_buftarg_isolate,
+				       &dispose, sc->nr_to_scan,
+				       &sc->nodes_to_scan);
 
 	while (!list_empty(&dispose)) {
+		struct xfs_buf *bp;
 		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
 		list_del_init(&bp->b_lru);
 		xfs_buf_rele(bp);
 	}
+	return freed;
+}
+static long
+xfs_buftarg_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_buftarg	*btp = container_of(shrink,
+					struct xfs_buftarg, bt_shrinker);
 
-	return btp->bt_lru_nr;
+	return list_lru_count_nodemask(&btp->bt_lru, &sc->nodes_to_scan);
 }
 
 void
@@ -1632,11 +1618,11 @@ xfs_alloc_buftarg(
 	if (!btp->bt_bdi)
 		goto error;
 
-	INIT_LIST_HEAD(&btp->bt_lru);
-	spin_lock_init(&btp->bt_lru_lock);
+	list_lru_init(&btp->bt_lru);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
+	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
+	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&btp->bt_shrinker);
 	return btp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index e541c0a..54113a1 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -25,6 +25,7 @@
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/uio.h>
+#include <linux/list_lru.h>
 
 /*
  *	Base types
@@ -92,9 +93,7 @@ typedef struct xfs_buftarg {
 
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
-	struct list_head	bt_lru;
-	spinlock_t		bt_lru_lock;
-	unsigned int		bt_lru_nr;
+	struct list_lru		bt_lru;
 } xfs_buftarg_t;
 
 struct xfs_buf;
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 13/19] xfs: Node aware direct inode reclaim
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (11 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 12/19] xfs: convert buftarg LRU to generic code Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 14/19] xfs: use generic AG walk for background " Dave Chinner
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

XFS currently only tracks inodes for reclaim via tag bits in the
inode cache radix tree. While this is awesome for background reclaim
because it allows inodes to be reclaimed in ascending disk offset
order, it sucks for direct memory reclaim which really is trying to
free the oldest inodes from memory.

As such, the direct reclaim code is a bit of a mess. It has all
sorts of heuristics code to try to avoid dueling shrinker problems
and to limit each radix tree to a single direct reclaim walker at a
time. We can do better.

Given that at the point in time that we mark an inode as under
reclaim, it has been evicted from the VFS inode cache, we can reuse
the struct inode LRU fields to hold our own reclaim ordered LRU
list. With the generic LRU code, it doesn't impact on scalability,
and the shrinker can walk the LRU lists directly giving us
node-aware inode cache reclaim.

This means that we get the best of both worlds - background reclaim
runs very efficiently in terms of IO for cleaning dirty reclaimable
inodes, while direct reclaim can walk the LRU lists and pick inodes
to reclaim that suit the MM subsystem the best.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |   77 ++++++++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_icache.h |    4 +--
 fs/xfs/xfs_linux.h  |    1 +
 fs/xfs/xfs_mount.h  |    2 +-
 fs/xfs/xfs_super.c  |    6 ++--
 5 files changed, 65 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2f91e2b..82b053f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -244,6 +244,8 @@ xfs_iget_cache_hit(
 
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
+
+		list_lru_del(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -990,6 +992,17 @@ reclaim:
 	spin_unlock(&pag->pag_ici_lock);
 
 	/*
+	 * iT is safe to do this unlocked check as we've guaranteed that we have
+	 * exclusive access to this inode via the XFS_IRECLAIM flag. Hence
+	 * concurrent LRU list walks will avoid removing this inode from the
+	 * list. For direct reclaim, we know the inode has already been removed
+	 * from any list it might be on, hence there's no need to traffic the
+	 * LRU code to find that out.
+	 */
+	if (!list_empty(&VFS_I(ip)->i_lru))
+		list_lru_del(&ip->i_mount->m_inode_lru, &VFS_I(ip)->i_lru);
+
+	/*
 	 * Here we do an (almost) spurious inode lock in order to coordinate
 	 * with inode cache radix tree lookups.  This is because the lookup
 	 * can reference the inodes in the cache without taking references.
@@ -1155,6 +1168,32 @@ xfs_reclaim_inodes(
 	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
 }
 
+static int
+xfs_reclaim_inode_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*cb_arg)
+{
+	struct inode		*inode = container_of(item, struct inode,
+						      i_lru);
+	struct list_head	*dispose = cb_arg;
+
+	rcu_read_lock();
+	if (xfs_reclaim_inode_grab(XFS_I(inode), SYNC_TRYLOCK)) {
+		/* not a reclaim candiate, skip it */
+		rcu_read_unlock();
+		return 2;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * We have the XFS_IRECLAIM flag set now, so nobody is going to touch
+	 * this inode now except us.
+	 */
+	list_move(item, dispose);
+	return 0;
+}
+
 /*
  * Scan a certain number of inodes for reclaim.
  *
@@ -1167,36 +1206,34 @@ xfs_reclaim_inodes(
 long
 xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
-	long			nr_to_scan)
+	long			nr_to_scan,
+	nodemask_t		*nodes_to_scan)
 {
-	long nr = nr_to_scan;
+	LIST_HEAD(dispose);
+	long freed;
 
 	/* kick background reclaimer and push the AIL */
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr);
-	return nr_to_scan - nr;
-}
+	freed = list_lru_walk_nodemask(&mp->m_inode_lru,
+				       xfs_reclaim_inode_isolate, &dispose,
+				       nr_to_scan, nodes_to_scan);
 
-/*
- * Return the number of reclaimable inodes in the filesystem for
- * the shrinker to determine how much to reclaim.
- */
-long
-xfs_reclaim_inodes_count(
-	struct xfs_mount	*mp)
-{
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag = 0;
-	long			reclaimable = 0;
+	while (!list_empty(&dispose)) {
+		struct xfs_perag *pag;
+		struct inode	*inode;
 
-	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
-		ag = pag->pag_agno + 1;
-		reclaimable += pag->pag_ici_reclaimable;
+		inode = list_first_entry(&dispose, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
+
+		pag = xfs_perag_get(mp,
+				    XFS_INO_TO_AGNO(mp, XFS_I(inode)->i_ino));
+		xfs_reclaim_inode(XFS_I(inode), pag, SYNC_WAIT);
 		xfs_perag_put(pag);
 	}
-	return reclaimable;
+
+	return freed;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index c860d07..4214518 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -30,8 +30,8 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 void xfs_reclaim_worker(struct work_struct *work);
 
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
-long xfs_reclaim_inodes_count(struct xfs_mount *mp);
-long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan);
+long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan,
+			   nodemask_t *nodes_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index fe7e4df..40cc5d1 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -72,6 +72,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/list_sort.h>
+#include <linux/list_lru.h>
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index bab8314..859fd5d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -200,9 +200,9 @@ typedef struct xfs_mount {
 						     trimming */
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
-	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 						/* low free space thresholds */
+	struct list_lru		m_inode_lru;	/* direct inode reclaim list */
 
 	struct workqueue_struct	*m_data_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 33d67d5..814e07a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -968,6 +968,7 @@ xfs_fs_destroy_inode(
 	 * reclaim tear down all inodes.
 	 */
 out_reclaim:
+	list_lru_add(&ip->i_mount->m_inode_lru, &inode->i_lru);
 	xfs_inode_set_reclaim_tag(ip);
 }
 
@@ -1402,6 +1403,7 @@ xfs_fs_fill_super(
 	atomic_set(&mp->m_active_trans, 0);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
+	list_lru_init(&mp->m_inode_lru);
 
 	mp->m_super = sb;
 	sb->s_fs_info = mp;
@@ -1519,7 +1521,7 @@ xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
 	nodemask_t		*nodes_to_count)
 {
-	return xfs_reclaim_inodes_count(XFS_M(sb));
+	return list_lru_count_nodemask(&XFS_M(sb)->m_inode_lru, nodes_to_count);
 }
 
 static long
@@ -1528,7 +1530,7 @@ xfs_fs_free_cached_objects(
 	long			nr_to_scan,
 	nodemask_t		*nodes_to_scan)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan, nodes_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 14/19] xfs: use generic AG walk for background inode reclaim
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (12 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 13/19] xfs: Node aware direct inode reclaim Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 15/19] xfs: convert dquot cache lru to list_lru Dave Chinner
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

The per-ag inode cache radix trees are not walked by the shrinkers
any more, so there is no need for a special walker that contained
heurisitcs to prevent multiple shrinker instances from colliding
with each other. Hence we can just remote that and convert the code
to use the generic walker.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h          |    2 -
 fs/xfs/xfs_icache.c      |  217 +++++++++++-----------------------------------
 fs/xfs/xfs_icache.h      |    4 +-
 fs/xfs/xfs_mount.c       |    1 -
 fs/xfs/xfs_qm_syscalls.c |    2 +-
 5 files changed, 55 insertions(+), 171 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index f2aeedb..40a7df9 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -218,8 +218,6 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
-	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
 	spinlock_t	pag_buf_lock;	/* lock for pag_buf_tree */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 82b053f..5cfc2eb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -468,7 +468,8 @@ out_error_or_again:
 
 STATIC int
 xfs_inode_ag_walk_grab(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	int			flags)
 {
 	struct inode		*inode = VFS_I(ip);
 
@@ -517,6 +518,7 @@ STATIC int
 xfs_inode_ag_walk(
 	struct xfs_mount	*mp,
 	struct xfs_perag	*pag,
+	int			(*grab)(struct xfs_inode *ip, int flags),
 	int			(*execute)(struct xfs_inode *ip,
 					   struct xfs_perag *pag, int flags,
 					   void *args),
@@ -530,6 +532,9 @@ xfs_inode_ag_walk(
 	int			done;
 	int			nr_found;
 
+	if (!grab)
+		grab = xfs_inode_ag_walk_grab;
+
 restart:
 	done = 0;
 	skipped = 0;
@@ -564,7 +569,7 @@ restart:
 		for (i = 0; i < nr_found; i++) {
 			struct xfs_inode *ip = batch[i];
 
-			if (done || xfs_inode_ag_walk_grab(ip))
+			if (done || grab(ip, flags))
 				batch[i] = NULL;
 
 			/*
@@ -593,7 +598,8 @@ restart:
 			if (!batch[i])
 				continue;
 			error = execute(batch[i], pag, flags, args);
-			IRELE(batch[i]);
+			if (grab == xfs_inode_ag_walk_grab)
+				IRELE(batch[i]);
 			if (error == EAGAIN) {
 				skipped++;
 				continue;
@@ -617,35 +623,10 @@ restart:
 	return last_error;
 }
 
-/*
- * Background scanning to trim post-EOF preallocated space. This is queued
- * based on the 'background_prealloc_discard_period' tunable (5m by default).
- */
-STATIC void
-xfs_queue_eofblocks(
-	struct xfs_mount *mp)
-{
-	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
-		queue_delayed_work(mp->m_eofblocks_workqueue,
-				   &mp->m_eofblocks_work,
-				   msecs_to_jiffies(xfs_eofb_secs * 1000));
-	rcu_read_unlock();
-}
-
-void
-xfs_eofblocks_worker(
-	struct work_struct *work)
-{
-	struct xfs_mount *mp = container_of(to_delayed_work(work),
-				struct xfs_mount, m_eofblocks_work);
-	xfs_icache_free_eofblocks(mp, NULL);
-	xfs_queue_eofblocks(mp);
-}
-
 int
 xfs_inode_ag_iterator(
 	struct xfs_mount	*mp,
+	int			(*grab)(struct xfs_inode *ip, int flags),
 	int			(*execute)(struct xfs_inode *ip,
 					   struct xfs_perag *pag, int flags,
 					   void *args),
@@ -660,7 +641,8 @@ xfs_inode_ag_iterator(
 	ag = 0;
 	while ((pag = xfs_perag_get(mp, ag))) {
 		ag = pag->pag_agno + 1;
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
+		error = xfs_inode_ag_walk(mp, pag, grab, execute,
+					  flags, args, -1);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -674,6 +656,7 @@ xfs_inode_ag_iterator(
 int
 xfs_inode_ag_iterator_tag(
 	struct xfs_mount	*mp,
+	int			(*grab)(struct xfs_inode *ip, int flags),
 	int			(*execute)(struct xfs_inode *ip,
 					   struct xfs_perag *pag, int flags,
 					   void *args),
@@ -689,7 +672,8 @@ xfs_inode_ag_iterator_tag(
 	ag = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, tag))) {
 		ag = pag->pag_agno + 1;
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag);
+		error = xfs_inode_ag_walk(mp, pag, grab, execute,
+					  flags, args, tag);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -904,7 +888,8 @@ STATIC int
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
-	int			sync_mode)
+	int			sync_mode,
+	void			*args)
 {
 	struct xfs_buf		*bp = NULL;
 	int			error;
@@ -1032,140 +1017,14 @@ out:
 	return 0;
 }
 
-/*
- * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
- * corrupted, we still want to try to reclaim all the inodes. If we don't,
- * then a shut down during filesystem unmount reclaim walk leak all the
- * unreclaimed inodes.
- */
-STATIC int
-xfs_reclaim_inodes_ag(
-	struct xfs_mount	*mp,
-	int			flags,
-	long			*nr_to_scan)
-{
-	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
-	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
-
-restart:
-	ag = 0;
-	skipped = 0;
-	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
-		unsigned long	first_index = 0;
-		int		done = 0;
-		int		nr_found = 0;
-
-		ag = pag->pag_agno + 1;
-
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
-
-		do {
-			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
-			int	i;
-
-			rcu_read_lock();
-			nr_found = radix_tree_gang_lookup_tag(
-					&pag->pag_ici_root,
-					(void **)batch, first_index,
-					XFS_LOOKUP_BATCH,
-					XFS_ICI_RECLAIM_TAG);
-			if (!nr_found) {
-				done = 1;
-				rcu_read_unlock();
-				break;
-			}
-
-			/*
-			 * Grab the inodes before we drop the lock. if we found
-			 * nothing, nr == 0 and the loop will be skipped.
-			 */
-			for (i = 0; i < nr_found; i++) {
-				struct xfs_inode *ip = batch[i];
-
-				if (done || xfs_reclaim_inode_grab(ip, flags))
-					batch[i] = NULL;
-
-				/*
-				 * Update the index for the next lookup. Catch
-				 * overflows into the next AG range which can
-				 * occur if we have inodes in the last block of
-				 * the AG and we are currently pointing to the
-				 * last inode.
-				 *
-				 * Because we may see inodes that are from the
-				 * wrong AG due to RCU freeing and
-				 * reallocation, only update the index if it
-				 * lies in this AG. It was a race that lead us
-				 * to see this inode, so another lookup from
-				 * the same index will not find it again.
-				 */
-				if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
-								pag->pag_agno)
-					continue;
-				first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
-				if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
-					done = 1;
-			}
-
-			/* unlock now we've grabbed the inodes. */
-			rcu_read_unlock();
-
-			for (i = 0; i < nr_found; i++) {
-				if (!batch[i])
-					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != EFSCORRUPTED)
-					last_error = error;
-			}
-
-			*nr_to_scan -= XFS_LOOKUP_BATCH;
-
-			cond_resched();
-
-		} while (nr_found && !done && *nr_to_scan > 0);
-
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
-		xfs_perag_put(pag);
-	}
-
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return XFS_ERROR(last_error);
-}
-
 int
 xfs_reclaim_inodes(
-	xfs_mount_t	*mp,
-	int		mode)
+	struct xfs_mount	*mp,
+	int			flags)
 {
-	long		nr_to_scan = LONG_MAX;
-
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	return xfs_inode_ag_iterator_tag(mp, xfs_reclaim_inode_grab,
+					 xfs_reclaim_inode, flags,
+					 NULL, XFS_ICI_RECLAIM_TAG);
 }
 
 static int
@@ -1229,13 +1088,39 @@ xfs_reclaim_inodes_nr(
 
 		pag = xfs_perag_get(mp,
 				    XFS_INO_TO_AGNO(mp, XFS_I(inode)->i_ino));
-		xfs_reclaim_inode(XFS_I(inode), pag, SYNC_WAIT);
+		xfs_reclaim_inode(XFS_I(inode), pag, SYNC_WAIT, NULL);
 		xfs_perag_put(pag);
 	}
 
 	return freed;
 }
 
+/*
+ * Background scanning to trim post-EOF preallocated space. This is queued
+ * based on the 'background_prealloc_discard_period' tunable (5m by default).
+ */
+STATIC void
+xfs_queue_eofblocks(
+	struct xfs_mount *mp)
+{
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
+		queue_delayed_work(mp->m_eofblocks_workqueue,
+				   &mp->m_eofblocks_work,
+				   msecs_to_jiffies(xfs_eofb_secs * 1000));
+	rcu_read_unlock();
+}
+
+void
+xfs_eofblocks_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+				struct xfs_mount, m_eofblocks_work);
+	xfs_icache_free_eofblocks(mp, NULL);
+	xfs_queue_eofblocks(mp);
+}
+
 STATIC int
 xfs_inode_match_id(
 	struct xfs_inode	*ip,
@@ -1310,8 +1195,8 @@ xfs_icache_free_eofblocks(
 	if (eofb && (eofb->eof_flags & XFS_EOF_FLAGS_SYNC))
 		flags = SYNC_WAIT;
 
-	return xfs_inode_ag_iterator_tag(mp, xfs_inode_free_eofblocks, flags,
-					 eofb, XFS_ICI_EOFBLOCKS_TAG);
+	return xfs_inode_ag_iterator_tag(mp, NULL, xfs_inode_free_eofblocks,
+					 flags, eofb, XFS_ICI_EOFBLOCKS_TAG);
 }
 
 void
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 4214518..a3380bf 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -29,7 +29,7 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 
 void xfs_reclaim_worker(struct work_struct *work);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+int xfs_reclaim_inodes(struct xfs_mount *mp, int flags);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan,
 			   nodemask_t *nodes_to_scan);
 
@@ -42,10 +42,12 @@ void xfs_eofblocks_worker(struct work_struct *);
 
 int xfs_sync_inode_grab(struct xfs_inode *ip);
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
+	int (*grab)(struct xfs_inode *ip, int flags),
 	int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag,
 		int flags, void *args),
 	int flags, void *args);
 int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
+	int (*grab)(struct xfs_inode *ip, int flags),
 	int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag,
 		int flags, void *args),
 	int flags, void *args, int tag);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index da50846..6985a32 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -456,7 +456,6 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		spin_lock_init(&pag->pag_buf_lock);
 		pag->pag_buf_tree = RB_ROOT;
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 5f53e75..85294a6 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -883,5 +883,5 @@ xfs_qm_dqrele_all_inodes(
 	uint		 flags)
 {
 	ASSERT(mp->m_quotainfo);
-	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, NULL);
+	xfs_inode_ag_iterator(mp, NULL, xfs_dqrele_inode, flags, NULL);
 }
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 15/19] xfs: convert dquot cache lru to list_lru
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (13 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 14/19] xfs: use generic AG walk for background " Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-28 16:17   ` Christoph Hellwig
  2012-11-27 23:14 ` [PATCH 16/19] fs: convert fs shrinkers to new scan/count API Dave Chinner
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the XFS dquot lru to use the list_lru construct and convert
the shrinker to being node aware.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_dquot.c |    7 +-
 fs/xfs/xfs_qm.c    |  307 ++++++++++++++++++++++++++--------------------------
 fs/xfs/xfs_qm.h    |    4 +-
 3 files changed, 156 insertions(+), 162 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 9e1bf52..4fcd42e 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -843,13 +843,8 @@ xfs_qm_dqput_final(
 
 	trace_xfs_dqput_free(dqp);
 
-	mutex_lock(&qi->qi_lru_lock);
-	if (list_empty(&dqp->q_lru)) {
-		list_add_tail(&dqp->q_lru, &qi->qi_lru_list);
-		qi->qi_lru_count++;
+	if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
 		XFS_STATS_INC(xs_qm_dquot_unused);
-	}
-	mutex_unlock(&qi->qi_lru_lock);
 
 	/*
 	 * If we just added a udquot to the freelist, then we want to release
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index e6a0af0..534b924 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -50,7 +50,6 @@
  */
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(struct shrinker *, struct shrink_control *);
 
 /*
  * We use the batch lookup interface to iterate over the dquots as it
@@ -196,12 +195,9 @@ xfs_qm_dqpurge(
 	 * We move dquots to the freelist as soon as their reference count
 	 * hits zero, so it really should be on the freelist here.
 	 */
-	mutex_lock(&qi->qi_lru_lock);
 	ASSERT(!list_empty(&dqp->q_lru));
-	list_del_init(&dqp->q_lru);
-	qi->qi_lru_count--;
+	list_lru_del(&qi->qi_lru, &dqp->q_lru);
 	XFS_STATS_DEC(xs_qm_dquot_unused);
-	mutex_unlock(&qi->qi_lru_lock);
 
 	xfs_qm_dqdestroy(dqp);
 
@@ -617,6 +613,156 @@ xfs_qm_dqdetach(
 	}
 }
 
+STATIC void
+xfs_qm_dqfree_one(
+	struct xfs_dquot	*dqp)
+{
+	struct xfs_mount	*mp = dqp->q_mount;
+	struct xfs_quotainfo	*qi = mp->m_quotainfo;
+
+	mutex_lock(&qi->qi_tree_lock);
+	radix_tree_delete(XFS_DQUOT_TREE(qi, dqp->q_core.d_flags),
+					 be32_to_cpu(dqp->q_core.d_id));
+
+	qi->qi_dquots--;
+	mutex_unlock(&qi->qi_tree_lock);
+
+	xfs_qm_dqdestroy(dqp);
+}
+
+struct xfs_qm_isolate {
+	struct list_head	buffers;
+	struct list_head	dispose;
+};
+
+static int
+xfs_qm_dquot_isolate(
+	struct list_head	*item,
+	spinlock_t		*lru_lock,
+	void			*arg)
+{
+	struct xfs_dquot	*dqp = container_of(item,
+						struct xfs_dquot, q_lru);
+	struct xfs_qm_isolate	*isol = arg;
+
+	if (!xfs_dqlock_nowait(dqp))
+		goto out_miss_busy;
+
+	/*
+	 * This dquot has acquired a reference in the meantime remove it from
+	 * the freelist and try again.
+	 */
+	if (dqp->q_nrefs) {
+		xfs_dqunlock(dqp);
+		XFS_STATS_INC(xs_qm_dqwants);
+
+		trace_xfs_dqreclaim_want(dqp);
+		list_del_init(&dqp->q_lru);
+		XFS_STATS_DEC(xs_qm_dquot_unused);
+		return 0;
+	}
+
+	/*
+	 * If the dquot is dirty, flush it. If it's already being flushed, just
+	 * skip it so there is time for the IO to complete before we try to
+	 * reclaim it again on the next LRU pass.
+	 */
+	if (!xfs_dqflock_nowait(dqp))
+		xfs_dqunlock(dqp);
+		goto out_miss_busy;
+
+	if (XFS_DQ_IS_DIRTY(dqp)) {
+		struct xfs_buf	*bp = NULL;
+		int		error;
+
+		trace_xfs_dqreclaim_dirty(dqp);
+
+		/* we have to drop the LRU lock to flush the dquot */
+		spin_unlock(lru_lock);
+
+		error = xfs_qm_dqflush(dqp, &bp);
+		if (error) {
+			xfs_warn(dqp->q_mount, "%s: dquot %p flush failed",
+				 __func__, dqp);
+			goto out_unlock_dirty;
+		}
+
+		xfs_buf_delwri_queue(bp, &isol->buffers);
+		xfs_buf_relse(bp);
+		goto out_unlock_dirty;
+	}
+	xfs_dqfunlock(dqp);
+
+	/*
+	 * Prevent lookups now that we are past the point of no return.
+	 */
+	dqp->dq_flags |= XFS_DQ_FREEING;
+	xfs_dqunlock(dqp);
+
+	ASSERT(dqp->q_nrefs == 0);
+	list_move_tail(&dqp->q_lru, &isol->dispose);
+	XFS_STATS_DEC(xs_qm_dquot_unused);
+	trace_xfs_dqreclaim_done(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaims);
+	return 0;
+
+out_miss_busy:
+	trace_xfs_dqreclaim_busy(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaim_misses);
+	return 2;
+
+out_unlock_dirty:
+	trace_xfs_dqreclaim_busy(dqp);
+	XFS_STATS_INC(xs_qm_dqreclaim_misses);
+	return 3;
+}
+
+static long
+xfs_qm_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_quotainfo	*qi = container_of(shrink,
+					struct xfs_quotainfo, qi_shrinker);
+	struct xfs_qm_isolate	isol;
+	long			freed;
+	int			error;
+
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+		return 0;
+
+	INIT_LIST_HEAD(&isol.buffers);
+	INIT_LIST_HEAD(&isol.dispose);
+
+	freed = list_lru_walk_nodemask(&qi->qi_lru, xfs_qm_dquot_isolate, &isol,
+					sc->nr_to_scan, &sc->nodes_to_scan);
+
+	error = xfs_buf_delwri_submit(&isol.buffers);
+	if (error)
+		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
+
+	while (!list_empty(&isol.dispose)) {
+		struct xfs_dquot	*dqp;
+
+		dqp = list_first_entry(&isol.dispose, struct xfs_dquot, q_lru);
+		list_del_init(&dqp->q_lru);
+		xfs_qm_dqfree_one(dqp);
+	}
+
+	return freed;
+}
+
+static long
+xfs_qm_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_quotainfo	*qi = container_of(shrink,
+					struct xfs_quotainfo, qi_shrinker);
+
+	return list_lru_count_nodemask(&qi->qi_lru, &sc->nodes_to_scan);
+}
+
 /*
  * This initializes all the quota information that's kept in the
  * mount structure
@@ -647,9 +793,7 @@ xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
 
-	INIT_LIST_HEAD(&qinf->qi_lru_list);
-	qinf->qi_lru_count = 0;
-	mutex_init(&qinf->qi_lru_lock);
+	list_lru_init(&qinf->qi_lru);
 
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
@@ -716,7 +860,8 @@ xfs_qm_init_quotainfo(
 		qinf->qi_rtbwarnlimit = XFS_QM_RTBWARNLIMIT;
 	}
 
-	qinf->qi_shrinker.shrink = xfs_qm_shake;
+	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
+	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
 	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&qinf->qi_shrinker);
 	return 0;
@@ -1428,150 +1573,6 @@ xfs_qm_init_quotainos(
 	return 0;
 }
 
-STATIC void
-xfs_qm_dqfree_one(
-	struct xfs_dquot	*dqp)
-{
-	struct xfs_mount	*mp = dqp->q_mount;
-	struct xfs_quotainfo	*qi = mp->m_quotainfo;
-
-	mutex_lock(&qi->qi_tree_lock);
-	radix_tree_delete(XFS_DQUOT_TREE(qi, dqp->q_core.d_flags),
-			  be32_to_cpu(dqp->q_core.d_id));
-
-	qi->qi_dquots--;
-	mutex_unlock(&qi->qi_tree_lock);
-
-	xfs_qm_dqdestroy(dqp);
-}
-
-STATIC void
-xfs_qm_dqreclaim_one(
-	struct xfs_dquot	*dqp,
-	struct list_head	*buffer_list,
-	struct list_head	*dispose_list)
-{
-	struct xfs_mount	*mp = dqp->q_mount;
-	struct xfs_quotainfo	*qi = mp->m_quotainfo;
-	int			error;
-
-	if (!xfs_dqlock_nowait(dqp))
-		goto out_busy;
-
-	/*
-	 * This dquot has acquired a reference in the meantime remove it from
-	 * the freelist and try again.
-	 */
-	if (dqp->q_nrefs) {
-		xfs_dqunlock(dqp);
-
-		trace_xfs_dqreclaim_want(dqp);
-		XFS_STATS_INC(xs_qm_dqwants);
-
-		list_del_init(&dqp->q_lru);
-		qi->qi_lru_count--;
-		XFS_STATS_DEC(xs_qm_dquot_unused);
-		return;
-	}
-
-	/*
-	 * Try to grab the flush lock. If this dquot is in the process of
-	 * getting flushed to disk, we don't want to reclaim it.
-	 */
-	if (!xfs_dqflock_nowait(dqp))
-		goto out_busy;
-
-	if (XFS_DQ_IS_DIRTY(dqp)) {
-		struct xfs_buf	*bp = NULL;
-
-		trace_xfs_dqreclaim_dirty(dqp);
-
-		error = xfs_qm_dqflush(dqp, &bp);
-		if (error) {
-			xfs_warn(mp, "%s: dquot %p flush failed",
-				 __func__, dqp);
-			goto out_busy;
-		}
-
-		xfs_buf_delwri_queue(bp, buffer_list);
-		xfs_buf_relse(bp);
-		/*
-		 * Give the dquot another try on the freelist, as the
-		 * flushing will take some time.
-		 */
-		goto out_busy;
-	}
-	xfs_dqfunlock(dqp);
-
-	/*
-	 * Prevent lookups now that we are past the point of no return.
-	 */
-	dqp->dq_flags |= XFS_DQ_FREEING;
-	xfs_dqunlock(dqp);
-
-	ASSERT(dqp->q_nrefs == 0);
-	list_move_tail(&dqp->q_lru, dispose_list);
-	qi->qi_lru_count--;
-	XFS_STATS_DEC(xs_qm_dquot_unused);
-
-	trace_xfs_dqreclaim_done(dqp);
-	XFS_STATS_INC(xs_qm_dqreclaims);
-	return;
-
-out_busy:
-	xfs_dqunlock(dqp);
-
-	/*
-	 * Move the dquot to the tail of the list so that we don't spin on it.
-	 */
-	list_move_tail(&dqp->q_lru, &qi->qi_lru_list);
-
-	trace_xfs_dqreclaim_busy(dqp);
-	XFS_STATS_INC(xs_qm_dqreclaim_misses);
-}
-
-STATIC int
-xfs_qm_shake(
-	struct shrinker		*shrink,
-	struct shrink_control	*sc)
-{
-	struct xfs_quotainfo	*qi =
-		container_of(shrink, struct xfs_quotainfo, qi_shrinker);
-	int			nr_to_scan = sc->nr_to_scan;
-	LIST_HEAD		(buffer_list);
-	LIST_HEAD		(dispose_list);
-	struct xfs_dquot	*dqp;
-	int			error;
-
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
-		return 0;
-	if (!nr_to_scan)
-		goto out;
-
-	mutex_lock(&qi->qi_lru_lock);
-	while (!list_empty(&qi->qi_lru_list)) {
-		if (nr_to_scan-- <= 0)
-			break;
-		dqp = list_first_entry(&qi->qi_lru_list, struct xfs_dquot,
-				       q_lru);
-		xfs_qm_dqreclaim_one(dqp, &buffer_list, &dispose_list);
-	}
-	mutex_unlock(&qi->qi_lru_lock);
-
-	error = xfs_buf_delwri_submit(&buffer_list);
-	if (error)
-		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
-
-	while (!list_empty(&dispose_list)) {
-		dqp = list_first_entry(&dispose_list, struct xfs_dquot, q_lru);
-		list_del_init(&dqp->q_lru);
-		xfs_qm_dqfree_one(dqp);
-	}
-
-out:
-	return (qi->qi_lru_count / 100) * sysctl_vfs_cache_pressure;
-}
-
 /*
  * Start a transaction and write the incore superblock changes to
  * disk. flags parameter indicates which fields have changed.
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 44b858b..d08b72d 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -47,9 +47,7 @@ typedef struct xfs_quotainfo {
 	struct mutex qi_tree_lock;
 	xfs_inode_t	*qi_uquotaip;	 /* user quota inode */
 	xfs_inode_t	*qi_gquotaip;	 /* group quota inode */
-	struct list_head qi_lru_list;
-	struct mutex	 qi_lru_lock;
-	int		 qi_lru_count;
+	struct list_lru	 qi_lru;
 	int		 qi_dquots;
 	time_t		 qi_btimelimit;	 /* limit for blks timer */
 	time_t		 qi_itimelimit;	 /* limit for inodes timer */
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 16/19] fs: convert fs shrinkers to new scan/count API
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (14 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 15/19] xfs: convert dquot cache lru to list_lru Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the filesystem shrinkers to use the new API, and standardise
some of the behaviours of the shrinkers at the same time. For
example, nr_to_scan means the number of objects to scan, not the
number of objects to free.

I refactored the CIFS idmap shrinker a little - it really needs to
be broken up into a shrinker per tree and keep an item count with
the tree root so that we don't need to walk the tree every time the
shrinker needs to count the number of objects in the tree (i.e.
all the time under memory pressure).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/cifs/cifsacl.c |  112 ++++++++++++++++++++++++++++++++---------------------
 fs/gfs2/glock.c   |   23 ++++++-----
 fs/gfs2/main.c    |    3 +-
 fs/gfs2/quota.c   |   12 +++---
 fs/gfs2/quota.h   |    4 +-
 fs/mbcache.c      |   53 ++++++++++++++-----------
 fs/nfs/dir.c      |   21 ++++++++--
 fs/nfs/internal.h |    4 +-
 fs/nfs/super.c    |    3 +-
 fs/quota/dquot.c  |   37 ++++++++----------
 10 files changed, 163 insertions(+), 109 deletions(-)

diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c
index 0fb15bb..a0e5c22 100644
--- a/fs/cifs/cifsacl.c
+++ b/fs/cifs/cifsacl.c
@@ -44,66 +44,95 @@ static const struct cifs_sid sid_user = {1, 2 , {0, 0, 0, 0, 0, 5}, {} };
 
 const struct cred *root_cred;
 
-static void
-shrink_idmap_tree(struct rb_root *root, int nr_to_scan, int *nr_rem,
-			int *nr_del)
+static long
+cifs_idmap_tree_scan(
+	struct rb_root	*root,
+	spinlock_t	*tree_lock,
+	long		nr_to_scan)
 {
 	struct rb_node *node;
-	struct rb_node *tmp;
-	struct cifs_sid_id *psidid;
+	long freed = 0;
 
+	spin_lock(tree_lock);
 	node = rb_first(root);
-	while (node) {
+	while (nr_to_scan-- >= 0 && node) {
+		struct cifs_sid_id *psidid;
+		struct rb_node *tmp;
+
 		tmp = node;
 		node = rb_next(tmp);
 		psidid = rb_entry(tmp, struct cifs_sid_id, rbnode);
-		if (nr_to_scan == 0 || *nr_del == nr_to_scan)
-			++(*nr_rem);
-		else {
-			if (time_after(jiffies, psidid->time + SID_MAP_EXPIRE)
-						&& psidid->refcount == 0) {
-				rb_erase(tmp, root);
-				++(*nr_del);
-			} else
-				++(*nr_rem);
+		if (time_after(jiffies, psidid->time + SID_MAP_EXPIRE)
+					&& psidid->refcount == 0) {
+			rb_erase(tmp, root);
+			freed++;
 		}
 	}
+	spin_unlock(tree_lock);
+	return freed;
+}
+
+static long
+cifs_idmap_tree_count(
+	struct rb_root	*root,
+	spinlock_t	*tree_lock)
+{
+	struct rb_node *node;
+	long count = 0;
+
+	spin_lock(tree_lock);
+	node = rb_first(root);
+	while (node) {
+		node = rb_next(node);
+		count++;
+	}
+	spin_unlock(tree_lock);
+	return count;
 }
 
 /*
- * Run idmap cache shrinker.
+ * idmap tree shrinker.
+ *
+ * XXX (dchinner): this should really be 4 separate shrinker instances (one per
+ * tree structure) so that each are shrunk proportionally to their individual
+ * sizes.
  */
-static int
-cifs_idmap_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+cifs_idmap_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
-	int nr_to_scan = sc->nr_to_scan;
-	int nr_del = 0;
-	int nr_rem = 0;
-	struct rb_root *root;
+	long freed = 0;
 
-	root = &uidtree;
-	spin_lock(&siduidlock);
-	shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-	spin_unlock(&siduidlock);
+	freed += cifs_idmap_tree_scan(&uidtree, &siduidlock, sc->nr_to_scan);
+	freed += cifs_idmap_tree_scan(&gidtree, &sidgidlock, sc->nr_to_scan);
+	freed += cifs_idmap_tree_scan(&siduidtree, &siduidlock, sc->nr_to_scan);
+	freed += cifs_idmap_tree_scan(&sidgidtree, &sidgidlock, sc->nr_to_scan);
 
-	root = &gidtree;
-	spin_lock(&sidgidlock);
-	shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-	spin_unlock(&sidgidlock);
+	return freed;
+}
 
-	root = &siduidtree;
-	spin_lock(&uidsidlock);
-	shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-	spin_unlock(&uidsidlock);
+static long
+cifs_idmap_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	long count = 0;
 
-	root = &sidgidtree;
-	spin_lock(&gidsidlock);
-	shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-	spin_unlock(&gidsidlock);
+	count += cifs_idmap_tree_count(&uidtree, &siduidlock);
+	count += cifs_idmap_tree_count(&gidtree, &sidgidlock);
+	count += cifs_idmap_tree_count(&siduidtree, &siduidlock);
+	count += cifs_idmap_tree_count(&sidgidtree, &sidgidlock);
 
-	return nr_rem;
+	return count;
 }
 
+static struct shrinker cifs_shrinker = {
+	.count_objects = cifs_idmap_shrink_count,
+	.scan_objects = cifs_idmap_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static void
 sid_rb_insert(struct rb_root *root, unsigned long cid,
 		struct cifs_sid_id **psidid, char *typestr)
@@ -161,11 +190,6 @@ sid_rb_search(struct rb_root *root, unsigned long cid)
 	return NULL;
 }
 
-static struct shrinker cifs_shrinker = {
-	.shrink = cifs_idmap_shrinker,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static int
 cifs_idmap_key_instantiate(struct key *key, struct key_preparsed_payload *prep)
 {
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e6c2fd5..4be61a22e 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1365,9 +1365,8 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
 		gfs2_glock_put(gl);
 }
 
-
-static int gfs2_shrink_glock_memory(struct shrinker *shrink,
-				    struct shrink_control *sc)
+static long gfs2_glock_shrink_scan(struct shrinker *shrink,
+				   struct shrink_control *sc)
 {
 	struct gfs2_glock *gl;
 	int may_demote;
@@ -1375,15 +1374,13 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
 	int nr = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
 	LIST_HEAD(skipped);
-
-	if (nr == 0)
-		goto out;
+	long freed = 0;
 
 	if (!(gfp_mask & __GFP_FS))
 		return -1;
 
 	spin_lock(&lru_lock);
-	while(nr && !list_empty(&lru_list)) {
+	while(nr-- >= 0 && !list_empty(&lru_list)) {
 		gl = list_entry(lru_list.next, struct gfs2_glock, gl_lru);
 		list_del_init(&gl->gl_lru);
 		clear_bit(GLF_LRU, &gl->gl_flags);
@@ -1397,7 +1394,7 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
 			may_demote = demote_ok(gl);
 			if (may_demote) {
 				handle_callback(gl, LM_ST_UNLOCKED, 0);
-				nr--;
+				freed++;
 			}
 			clear_bit(GLF_LOCK, &gl->gl_flags);
 			smp_mb__after_clear_bit();
@@ -1414,12 +1411,18 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
 	list_splice(&skipped, &lru_list);
 	atomic_add(nr_skipped, &lru_count);
 	spin_unlock(&lru_lock);
-out:
+	return freed;
+}
+
+static long gfs2_glock_shrink_count(struct shrinker *shrink,
+				    struct shrink_control *sc)
+{
 	return (atomic_read(&lru_count) / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker glock_shrinker = {
-	.shrink = gfs2_shrink_glock_memory,
+	.count_objects = gfs2_glock_shrink_count,
+	.scan_objects = gfs2_glock_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index e04d0e0..a105d84 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -32,7 +32,8 @@
 struct workqueue_struct *gfs2_control_wq;
 
 static struct shrinker qd_shrinker = {
-	.shrink = gfs2_shrink_qd_memory,
+	.count_objects = gfs2_qd_shrink_count,
+	.scan_objects = gfs2_qd_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index c5af8e1..f4bf289 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -78,14 +78,12 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
+long gfs2_qd_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
 	int nr_to_scan = sc->nr_to_scan;
-
-	if (nr_to_scan == 0)
-		goto out;
+	long freed = 0;
 
 	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
@@ -113,10 +111,14 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
 		kmem_cache_free(gfs2_quotad_cachep, qd);
 		spin_lock(&qd_lru_lock);
 		nr_to_scan--;
+		freed++;
 	}
 	spin_unlock(&qd_lru_lock);
+	return freed;
+}
 
-out:
+long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
 	return (atomic_read(&qd_lru_count) * sysctl_vfs_cache_pressure) / 100;
 }
 
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index f25d98b..e0bd306 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -52,7 +52,9 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(struct shrinker *shrink,
+extern long gfs2_qd_shrink_count(struct shrinker *shrink,
+				 struct shrink_control *sc);
+extern long gfs2_qd_shrink_scan(struct shrinker *shrink,
 				 struct shrink_control *sc);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..04bf2fb 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -86,18 +86,6 @@ static LIST_HEAD(mb_cache_list);
 static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
-/*
- * What the mbcache registers as to get shrunk dynamically.
- */
-
-static int mb_cache_shrink_fn(struct shrinker *shrink,
-			      struct shrink_control *sc);
-
-static struct shrinker mb_cache_shrinker = {
-	.shrink = mb_cache_shrink_fn,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static inline int
 __mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
 {
@@ -151,7 +139,7 @@ forget:
 
 
 /*
- * mb_cache_shrink_fn()  memory pressure callback
+ * mb_cache_shrink_scan()  memory pressure callback
  *
  * This function is called by the kernel memory management when memory
  * gets low.
@@ -159,17 +147,18 @@ forget:
  * @shrink: (ignored)
  * @sc: shrink_control passed from reclaim
  *
- * Returns the number of objects which are present in the cache.
+ * Returns the number of objects freed.
  */
-static int
-mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
+static long
+mb_cache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	LIST_HEAD(free_list);
-	struct mb_cache *cache;
 	struct mb_cache_entry *entry, *tmp;
-	int count = 0;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	long freed = 0;
 
 	mb_debug("trying to free %d entries", nr_to_scan);
 	spin_lock(&mb_cache_spinlock);
@@ -179,19 +168,39 @@ mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 				   struct mb_cache_entry, e_lru_list);
 		list_move_tail(&ce->e_lru_list, &free_list);
 		__mb_cache_entry_unhash(ce);
+		freed++;
+	}
+	spin_unlock(&mb_cache_spinlock);
+	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
+		__mb_cache_entry_forget(entry, gfp_mask);
 	}
+	return freed;
+}
+
+static long
+mb_cache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct mb_cache *cache;
+	long count = 0;
+
+	spin_lock(&mb_cache_spinlock);
 	list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
 		mb_debug("cache %s (%d)", cache->c_name,
 			  atomic_read(&cache->c_entry_count));
 		count += atomic_read(&cache->c_entry_count);
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
-		__mb_cache_entry_forget(entry, gfp_mask);
-	}
-	return (count / 100) * sysctl_vfs_cache_pressure;
+	return count;
 }
 
+static struct shrinker mb_cache_shrinker = {
+	.count_objects = mb_cache_shrink_count,
+	.scan_objects = mb_cache_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 
 /*
  * mb_cache_create()  create a new cache
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ce8cb92..15f6fbb 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1909,17 +1909,20 @@ static void nfs_access_free_list(struct list_head *head)
 	}
 }
 
-int nfs_access_cache_shrinker(struct shrinker *shrink,
-			      struct shrink_control *sc)
+long
+nfs_access_cache_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi, *next;
 	struct nfs_access_entry *cache;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	long freed = 0;
 
 	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return (nr_to_scan == 0) ? 0 : -1;
+		return -1;
 
 	spin_lock(&nfs_access_lru_lock);
 	list_for_each_entry_safe(nfsi, next, &nfs_access_lru_list, access_cache_inode_lru) {
@@ -1935,6 +1938,7 @@ int nfs_access_cache_shrinker(struct shrinker *shrink,
 				struct nfs_access_entry, lru);
 		list_move(&cache->lru, &head);
 		rb_erase(&cache->rb_node, &nfsi->access_cache);
+		freed++;
 		if (!list_empty(&nfsi->access_cache_entry_lru))
 			list_move_tail(&nfsi->access_cache_inode_lru,
 					&nfs_access_lru_list);
@@ -1949,7 +1953,16 @@ remove_lru_entry:
 	}
 	spin_unlock(&nfs_access_lru_lock);
 	nfs_access_free_list(&head);
-	return (atomic_long_read(&nfs_access_nr_entries) / 100) * sysctl_vfs_cache_pressure;
+	return freed;
+}
+
+long
+nfs_access_cache_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	return (atomic_long_read(&nfs_access_nr_entries) / 100) *
+						sysctl_vfs_cache_pressure;
 }
 
 static void __nfs_access_zap_cache(struct nfs_inode *nfsi, struct list_head *head)
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 05521ca..6b9c45b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -285,7 +285,9 @@ extern struct nfs_client *nfs_init_client(struct nfs_client *clp,
 			   const char *ip_addr, rpc_authflavor_t authflavour);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(struct shrinker *shrink,
+extern long nfs_access_cache_count(struct shrinker *shrink,
+					struct shrink_control *sc);
+extern long nfs_access_cache_scan(struct shrinker *shrink,
 					struct shrink_control *sc);
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 int nfs_create(struct inode *, struct dentry *, umode_t, bool);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 652d3f7..e7dd232 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -354,7 +354,8 @@ static void unregister_nfs4_fs(void)
 #endif
 
 static struct shrinker acl_shrinker = {
-	.shrink		= nfs_access_cache_shrinker,
+	.count_objects	= nfs_access_cache_count,
+	.scan_objects	= nfs_access_cache_scan,
 	.seeks		= DEFAULT_SEEKS,
 };
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 05ae3c9..544bd65 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -687,45 +687,42 @@ int dquot_quota_sync(struct super_block *sb, int type)
 }
 EXPORT_SYMBOL(dquot_quota_sync);
 
-/* Free unused dquots from cache */
-static void prune_dqcache(int count)
+static long
+dqcache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct list_head *head;
 	struct dquot *dquot;
+	long freed = 0;
 
 	head = free_dquots.prev;
-	while (head != &free_dquots && count) {
+	while (head != &free_dquots && sc->nr_to_scan) {
 		dquot = list_entry(head, struct dquot, dq_free);
 		remove_dquot_hash(dquot);
 		remove_free_dquot(dquot);
 		remove_inuse(dquot);
 		do_destroy_dquot(dquot);
-		count--;
+		sc->nr_to_scan--;
+		freed++;
 		head = free_dquots.prev;
 	}
+	return freed;
 }
 
-/*
- * This is called from kswapd when we think we need some
- * more memory
- */
-static int shrink_dqcache_memory(struct shrinker *shrink,
-				 struct shrink_control *sc)
+static long
+dqcache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
-	int nr = sc->nr_to_scan;
-
-	if (nr) {
-		spin_lock(&dq_list_lock);
-		prune_dqcache(nr);
-		spin_unlock(&dq_list_lock);
-	}
-	return ((unsigned)
+	return ((long)
 		percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS])
-		/100) * sysctl_vfs_cache_pressure;
+		/ 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker dqcache_shrinker = {
-	.shrink = shrink_dqcache_memory,
+	.count_objects = dqcache_shrink_count,
+	.scan_objects = dqcache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (15 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 16/19] fs: convert fs shrinkers to new scan/count API Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-28  1:13   ` Chris Wilson
  2013-06-07 13:37   ` Konrad Rzeszutek Wilk
  2012-11-27 23:14 ` [PATCH 18/19] shrinker: convert remaining shrinkers to " Dave Chinner
                   ` (4 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the driver shrinkers to the new API. Most changes are
compile tested only because I either don't have the hardware or it's
staging stuff.

FWIW, the md and android code is pretty good, but the rest of it
makes me want to claw my eyes out.  The amount of broken code I just
encountered is mind boggling.  I've added comments explaining what
is broken, but I fear that some of the code would be best dealt with
by being dragged behind the bike shed, burying in mud up to it's
neck and then run over repeatedly with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers. They can't
co-exist in the build at the same time, they are under different
menu options in menuconfig, they only show up when you've got the
right set of mm subsystem options configured and so even compile
testing is an exercise in pulling teeth.  And that doesn't even take
into account the horrible, broken code...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/gpu/drm/i915/i915_dma.c           |    4 +-
 drivers/gpu/drm/i915/i915_gem.c           |   64 +++++++++++++++++++++-------
 drivers/gpu/drm/ttm/ttm_page_alloc.c      |   48 ++++++++++++++-------
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c  |   55 +++++++++++++++---------
 drivers/md/dm-bufio.c                     |   65 +++++++++++++++++++----------
 drivers/staging/android/ashmem.c          |   44 ++++++++++++-------
 drivers/staging/android/lowmemorykiller.c |   60 +++++++++++++++++---------
 drivers/staging/ramster/zcache-main.c     |   58 ++++++++++++++++++-------
 drivers/staging/zcache/zcache-main.c      |   40 ++++++++++--------
 9 files changed, 297 insertions(+), 141 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 61ae104..0ddec32 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1658,7 +1658,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	return 0;
 
 out_gem_unload:
-	if (dev_priv->mm.inactive_shrinker.shrink)
+	if (dev_priv->mm.inactive_shrinker.scan_objects)
 		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
 	if (dev->pdev->msi_enabled)
@@ -1695,7 +1695,7 @@ int i915_driver_unload(struct drm_device *dev)
 
 	i915_teardown_sysfs(dev);
 
-	if (dev_priv->mm.inactive_shrinker.shrink)
+	if (dev_priv->mm.inactive_shrinker.scan_objects)
 		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
 	mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 107f09b..ceab752 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -53,8 +53,10 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
 					 struct drm_i915_fence_reg *fence,
 					 bool enable);
 
-static int i915_gem_inactive_shrink(struct shrinker *shrinker,
+static long i915_gem_inactive_count(struct shrinker *shrinker,
 				    struct shrink_control *sc);
+static long i915_gem_inactive_scan(struct shrinker *shrinker,
+				   struct shrink_control *sc);
 static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
 static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
 static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
@@ -4197,7 +4199,8 @@ i915_gem_load(struct drm_device *dev)
 
 	dev_priv->mm.interruptible = true;
 
-	dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
+	dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
+	dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
 	dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&dev_priv->mm.inactive_shrinker);
 }
@@ -4407,35 +4410,64 @@ void i915_gem_release(struct drm_device *dev, struct drm_file *file)
 	spin_unlock(&file_priv->mm.lock);
 }
 
-static int
-i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
+/*
+ * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
+ *
+ * i915_gem_purge() expects a byte count to be passed, and the minimum object
+ * size is PAGE_SIZE. The shrinker doesn't work on bytes - it works on
+ * *objects*. So it passes a nr_to_scan of 128 objects, which is interpreted
+ * here to mean "free 128 bytes". That means a single object will be freed, as
+ * the minimum object size is a page.
+ *
+ * But the craziest part comes when i915_gem_purge() has walked all the objects
+ * and can't free any memory. That results in i915_gem_shrink_all() being
+ * called, which idles the GPU and frees everything the driver has in it's
+ * active and inactive lists. It's basically hitting the driver with a great big
+ * hammer because it was busy doing stuff when something else generated memory
+ * pressure. This doesn't seem particularly wise...
+ */
+static long
+i915_gem_inactive_scan(struct shrinker *shrinker, struct shrink_control *sc)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(shrinker,
 			     struct drm_i915_private,
 			     mm.inactive_shrinker);
 	struct drm_device *dev = dev_priv->dev;
-	struct drm_i915_gem_object *obj;
-	int nr_to_scan = sc->nr_to_scan;
-	int cnt;
+	long freed = 0;
 
 	if (!mutex_trylock(&dev->struct_mutex))
 		return 0;
 
-	if (nr_to_scan) {
-		nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
-		if (nr_to_scan > 0)
-			i915_gem_shrink_all(dev_priv);
-	}
+	freed = i915_gem_purge(dev_priv, sc->nr_to_scan);
+	if (freed < sc->nr_to_scan)
+		i915_gem_shrink_all(dev_priv);
+
+	mutex_unlock(&dev->struct_mutex);
+	return freed;
+}
+
+static long
+i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(shrinker,
+			     struct drm_i915_private,
+			     mm.inactive_shrinker);
+	struct drm_device *dev = dev_priv->dev;
+	struct drm_i915_gem_object *obj;
+	long count = 0;
+
+	if (!mutex_trylock(&dev->struct_mutex))
+		return 0;
 
-	cnt = 0;
 	list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
 		if (obj->pages_pin_count == 0)
-			cnt += obj->base.size >> PAGE_SHIFT;
+			count += obj->base.size >> PAGE_SHIFT;
 	list_for_each_entry(obj, &dev_priv->mm.bound_list, gtt_list)
 		if (obj->pin_count == 0 && obj->pages_pin_count == 0)
-			cnt += obj->base.size >> PAGE_SHIFT;
+			count += obj->base.size >> PAGE_SHIFT;
 
 	mutex_unlock(&dev->struct_mutex);
-	return cnt;
+	return count;
 }
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index bd2a3b4..83058a2 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -377,28 +377,28 @@ out:
 	return nr_free;
 }
 
-/* Get good estimation how many pages are free in pools */
-static int ttm_pool_get_num_unused_pages(void)
-{
-	unsigned i;
-	int total = 0;
-	for (i = 0; i < NUM_POOLS; ++i)
-		total += _manager->pools[i].npages;
-
-	return total;
-}
-
 /**
  * Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
+ * this can deadlock when called a sc->gfp_mask that is not equal to
+ * GFP_KERNEL.
+ *
+ * This code is crying out for a shrinker per pool....
  */
-static int ttm_pool_mm_shrink(struct shrinker *shrink,
-			      struct shrink_control *sc)
+static long
+ttm_pool_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static atomic_t start_pool = ATOMIC_INIT(0);
 	unsigned i;
 	unsigned pool_offset = atomic_add_return(1, &start_pool);
 	struct ttm_page_pool *pool;
 	int shrink_pages = sc->nr_to_scan;
+	long freed = 0;
 
 	pool_offset = pool_offset % NUM_POOLS;
 	/* select start pool in round robin fashion */
@@ -408,14 +408,30 @@ static int ttm_pool_mm_shrink(struct shrinker *shrink,
 			break;
 		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
 		shrink_pages = ttm_page_pool_free(pool, nr_free);
+		freed += nr_free - shrink_pages;
 	}
-	/* return estimated number of unused pages in pool */
-	return ttm_pool_get_num_unused_pages();
+	return freed;
+}
+
+
+static long
+ttm_pool_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	unsigned i;
+	long count = 0;
+
+	for (i = 0; i < NUM_POOLS; ++i)
+		count += _manager->pools[i].npages;
+
+	return count;
 }
 
 static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
 {
-	manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
+	manager->mm_shrink.count_objects = &ttm_pool_shrink_count;
+	manager->mm_shrink.scan_objects = &ttm_pool_shrink_scan;
 	manager->mm_shrink.seeks = 1;
 	register_shrinker(&manager->mm_shrink);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index b8b3943..b3b4f99 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -918,19 +918,6 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(ttm_dma_populate);
 
-/* Get good estimation how many pages are free in pools */
-static int ttm_dma_pool_get_num_unused_pages(void)
-{
-	struct device_pools *p;
-	unsigned total = 0;
-
-	mutex_lock(&_manager->lock);
-	list_for_each_entry(p, &_manager->pools, pools)
-		total += p->pool->npages_free;
-	mutex_unlock(&_manager->lock);
-	return total;
-}
-
 /* Put all pages in pages list to correct pool to wait for reuse */
 void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 {
@@ -1002,18 +989,31 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
 
 /**
  * Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
+ * needs to be paid to sc->gfp_mask to determine if this can be done or not.
+ * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
+ * bad.
+ *
+ * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
+ * shrinkers
  */
-static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
-				  struct shrink_control *sc)
+static long
+ttm_dma_pool_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static atomic_t start_pool = ATOMIC_INIT(0);
 	unsigned idx = 0;
 	unsigned pool_offset = atomic_add_return(1, &start_pool);
 	unsigned shrink_pages = sc->nr_to_scan;
 	struct device_pools *p;
+	long freed = 0;
 
 	if (list_empty(&_manager->pools))
-		return 0;
+		return -1;
 
 	mutex_lock(&_manager->lock);
 	pool_offset = pool_offset % _manager->npools;
@@ -1029,18 +1029,35 @@ static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
 			continue;
 		nr_free = shrink_pages;
 		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
+		freed += nr_free - shrink_pages;
+
 		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
 			 p->pool->dev_name, p->pool->name, current->pid,
 			 nr_free, shrink_pages);
 	}
 	mutex_unlock(&_manager->lock);
-	/* return estimated number of unused pages in pool */
-	return ttm_dma_pool_get_num_unused_pages();
+	return freed;
+}
+
+static long
+ttm_dma_pool_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct device_pools *p;
+	long count = 0;
+
+	mutex_lock(&_manager->lock);
+	list_for_each_entry(p, &_manager->pools, pools)
+		count += p->pool->npages_free;
+	mutex_unlock(&_manager->lock);
+	return count;
 }
 
 static void ttm_dma_pool_mm_shrink_init(struct ttm_pool_manager *manager)
 {
-	manager->mm_shrink.shrink = &ttm_dma_pool_mm_shrink;
+	manager->mm_shrink.count_objects = &ttm_dma_pool_shrink_count;
+	manager->mm_shrink.scan_objects = &ttm_dma_pool_shrink_scan;
 	manager->mm_shrink.seeks = 1;
 	register_shrinker(&manager->mm_shrink);
 }
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 651ca79..0898bf5 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1359,62 +1359,80 @@ static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
 				unsigned long max_jiffies)
 {
 	if (jiffies - b->last_accessed < max_jiffies)
-		return 1;
+		return 0;
 
 	if (!(gfp & __GFP_IO)) {
 		if (test_bit(B_READING, &b->state) ||
 		    test_bit(B_WRITING, &b->state) ||
 		    test_bit(B_DIRTY, &b->state))
-			return 1;
+			return 0;
 	}
 
 	if (b->hold_count)
-		return 1;
+		return 0;
 
 	__make_buffer_clean(b);
 	__unlink_buffer(b);
 	__free_buffer_wake(b);
 
-	return 0;
+	return 1;
 }
 
-static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
-		   struct shrink_control *sc)
+static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
+		   gfp_t gfp_mask)
 {
 	int l;
 	struct dm_buffer *b, *tmp;
+	long freed = 0;
 
 	for (l = 0; l < LIST_SIZE; l++) {
-		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
-			if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
-			    !--nr_to_scan)
-				return;
+		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
+			freed += __cleanup_old_buffer(b, gfp_mask, 0);
+			if (!--nr_to_scan)
+				break;
+		}
 		dm_bufio_cond_resched();
 	}
+	return freed;
 }
 
-static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
+static long
+dm_bufio_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct dm_bufio_client *c =
-	    container_of(shrinker, struct dm_bufio_client, shrinker);
-	unsigned long r;
-	unsigned long nr_to_scan = sc->nr_to_scan;
+	    container_of(shrink, struct dm_bufio_client, shrinker);
+	long freed;
 
 	if (sc->gfp_mask & __GFP_IO)
 		dm_bufio_lock(c);
 	else if (!dm_bufio_trylock(c))
-		return !nr_to_scan ? 0 : -1;
+		return -1;
 
-	if (nr_to_scan)
-		__scan(c, nr_to_scan, sc);
+	freed  = __scan(c, sc->nr_to_scan, sc->gfp_mask);
+	dm_bufio_unlock(c);
+	return freed;
+}
 
-	r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
-	if (r > INT_MAX)
-		r = INT_MAX;
+static long
+dm_bufio_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct dm_bufio_client *c =
+	    container_of(shrink, struct dm_bufio_client, shrinker);
+	long count;
+
+	if (sc->gfp_mask & __GFP_IO)
+		dm_bufio_lock(c);
+	else if (!dm_bufio_trylock(c))
+		return 0;
 
+	count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
 	dm_bufio_unlock(c);
+	return count;
 
-	return r;
 }
 
 /*
@@ -1516,7 +1534,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
 	__cache_size_refresh();
 	mutex_unlock(&dm_bufio_clients_lock);
 
-	c->shrinker.shrink = shrink;
+	c->shrinker.count_objects = dm_bufio_shrink_count;
+	c->shrinker.scan_objects = dm_bufio_shrink_scan;
 	c->shrinker.seeks = 1;
 	c->shrinker.batch = 0;
 	register_shrinker(&c->shrinker);
@@ -1603,7 +1622,7 @@ static void cleanup_old_buffers(void)
 			struct dm_buffer *b;
 			b = list_entry(c->lru[LIST_CLEAN].prev,
 				       struct dm_buffer, lru_list);
-			if (__cleanup_old_buffer(b, 0, max_age * HZ))
+			if (!__cleanup_old_buffer(b, 0, max_age * HZ))
 				break;
 			dm_bufio_cond_resched();
 		}
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 634b9ae..30f9f8e 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -341,27 +341,28 @@ out:
 /*
  * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
  *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
+ * 'nr_to_scan' is the number of objects to scan for freeing.
  *
  * 'gfp_mask' is the mask of the allocation that got us into this mess.
  *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
+ * Return value is the number of objects freed or -1 if we cannot
  * proceed without risk of deadlock (due to gfp_mask).
  *
  * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
  * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
  * pages freed.
  */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
+static long
+ashmem_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct ashmem_range *range, *next;
+	long freed = 0;
 
 	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS))
 		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
 
 	mutex_lock(&ashmem_mutex);
 	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
@@ -374,17 +375,34 @@ static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
 		range->purged = ASHMEM_WAS_PURGED;
 		lru_del(range);
 
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
+		freed += range_size(range);
+		if (--sc->nr_to_scan <= 0)
 			break;
 	}
 	mutex_unlock(&ashmem_mutex);
+	return freed;
+}
 
+static long
+ashmem_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	/*
+	 * note that lru_count is count of pages on the lru, not a count of
+	 * objects on the list. This means the scan function needs to return the
+	 * number of pages freed, not the number of objects scanned.
+	 */
 	return lru_count;
 }
 
 static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
+	.count_objects = ashmem_shrink_count,
+	.scan_objects = ashmem_shrink_scan,
+	/*
+	 * XXX (dchinner): I wish people would comment on why they need on
+	 * significant changes to the default value here
+	 */
 	.seeks = DEFAULT_SEEKS * 4,
 };
 
@@ -671,11 +689,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		if (capable(CAP_SYS_ADMIN)) {
 			struct shrink_control sc = {
 				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
+				.nr_to_scan = LONG_MAX,
 			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
+			ashmem_shrink_scan(&ashmem_shrinker, &sc);
 		}
 		break;
 	}
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index b91e4bc..2bf2c2f 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -63,11 +63,19 @@ static unsigned long lowmem_deathpending_timeout;
 			printk(x);			\
 	} while (0)
 
-static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
+/*
+ * XXX (dchinner): this should all be using longs, not ints, as
+ * functions like global_page_state, get_mm_rss, etc all return longs or
+ * unsigned longs. Even the shrinker now uses longs....
+ */
+static long
+lowmem_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct task_struct *tsk;
 	struct task_struct *selected = NULL;
-	int rem = 0;
+	long freed = 0;
 	int tasksize;
 	int i;
 	int min_score_adj = OOM_SCORE_ADJ_MAX + 1;
@@ -89,19 +97,17 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
 			break;
 		}
 	}
-	if (sc->nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
-				sc->nr_to_scan, sc->gfp_mask, other_free,
-				other_file, min_score_adj);
-	rem = global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE) +
-		global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE);
-	if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
-		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
-			     sc->nr_to_scan, sc->gfp_mask, rem);
-		return rem;
+	lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
+			sc->nr_to_scan, sc->gfp_mask, other_free,
+			other_file, min_score_adj);
+
+	if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
+		/* nothing to do, no point in calling again */
+		lowmem_print(5, "lowmem_shrink %lu, %x, return -1\n",
+			     sc->nr_to_scan, sc->gfp_mask);
+		return -1;
 	}
+
 	selected_oom_score_adj = min_score_adj;
 
 	rcu_read_lock();
@@ -151,16 +157,32 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
 		lowmem_deathpending_timeout = jiffies + HZ;
 		send_sig(SIGKILL, selected, 0);
 		set_tsk_thread_flag(selected, TIF_MEMDIE);
-		rem -= selected_tasksize;
+		freed += selected_tasksize;
 	}
-	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
-		     sc->nr_to_scan, sc->gfp_mask, rem);
+	lowmem_print(4, "lowmem_shrink %lu, %x, return %ld\n",
+		     sc->nr_to_scan, sc->gfp_mask, freed);
 	rcu_read_unlock();
-	return rem;
+	return freed;
+}
+
+static long
+lowmem_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	long count;
+
+	count = global_page_state(NR_ACTIVE_ANON) +
+		global_page_state(NR_ACTIVE_FILE) +
+		global_page_state(NR_INACTIVE_ANON) +
+		global_page_state(NR_INACTIVE_FILE);
+	return count;
 }
 
 static struct shrinker lowmem_shrinker = {
-	.shrink = lowmem_shrink,
+	.count_objects = lowmem_shrink_count,
+	.scan_objects = lowmem_shrink_scan,
+	/* why can't we document magic numbers? */
 	.seeks = DEFAULT_SEEKS * 16
 };
 
diff --git a/drivers/staging/ramster/zcache-main.c b/drivers/staging/ramster/zcache-main.c
index a09dd5c..7d50688 100644
--- a/drivers/staging/ramster/zcache-main.c
+++ b/drivers/staging/ramster/zcache-main.c
@@ -1054,12 +1054,13 @@ static bool zcache_freeze;
  * used by zcache to approximately the same as the total number of LRU_FILE
  * pageframes in use.
  */
-static int shrink_zcache_memory(struct shrinker *shrink,
-				struct shrink_control *sc)
+static long
+zcache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	static bool in_progress;
-	int ret = -1;
-	int nr = sc->nr_to_scan;
+	long freed = 0;
 	int nr_evict = 0;
 	int nr_unuse = 0;
 	struct page *page;
@@ -1067,15 +1068,23 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 	int unuse_ret;
 #endif
 
-	if (nr <= 0)
-		goto skip_evict;
+	/*
+	 * XXX (dchinner): My kingdom for a mutex! There's no way this should
+	 * ever be allowed to move out of staging until it supports concurrent
+	 * shrinkers correctly.
+	 *
+	 * This whole shrinker is making me want to claw my eyes out. It has no
+	 * redeeming values whatsoever and I can't undo the damage it has
+	 * already done to my brain.
+	 */
 
 	/* don't allow more than one eviction thread at a time */
 	if (in_progress)
-		goto skip_evict;
+		return -1;
 
 	in_progress = true;
 
+
 	/* we are going to ignore nr, and target a different value */
 	zcache_last_active_file_pageframes =
 		global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
@@ -1083,11 +1092,13 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 		global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
 	nr_evict = zcache_eph_pageframes - zcache_last_active_file_pageframes +
 		zcache_last_inactive_file_pageframes;
+
 	while (nr_evict-- > 0) {
 		page = zcache_evict_eph_pageframe();
 		if (page == NULL)
 			break;
 		zcache_free_page(page);
+		freed++;
 	}
 
 	zcache_last_active_anon_pageframes =
@@ -1104,25 +1115,42 @@ static int shrink_zcache_memory(struct shrinker *shrink,
 		unuse_ret = zcache_frontswap_unuse();
 		if (unuse_ret == -ENOMEM)
 			break;
+		freed++;
 	}
 #endif
 	in_progress = false;
+	return freed;
+}
+
+
+/*
+ * XXX (dchinner): the shrinker updates global variables? You've got to be
+ * kidding me! And the object count can (apparently) go negative - that's
+ * *always* a bug so be bloody noisy about it.
+ */
+static long
+zcache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	long count;
 
-skip_evict:
-	/* resample: has changed, but maybe not all the way yet */
 	zcache_last_active_file_pageframes =
 		global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
 	zcache_last_inactive_file_pageframes =
 		global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
-	ret = zcache_eph_pageframes - zcache_last_active_file_pageframes +
-		zcache_last_inactive_file_pageframes;
-	if (ret < 0)
-		ret = 0;
-	return ret;
+
+	count = zcache_last_active_file_pageframes +
+		zcache_last_inactive_file_pageframes +
+		zcache_eph_pageframes;
+
+	WARN_ON_ONCE(count < 0);
+	return count < 0 ? 0 : count;
 }
 
 static struct shrinker zcache_shrinker = {
-	.shrink = shrink_zcache_memory,
+	.count_objects = zcache_shrink_count,
+	.scan_objects = zcache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index 52b43b7..d17ab5d 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -536,10 +536,11 @@ static void zbud_evict_zbpg(struct zbud_page *zbpg)
  * page in use by another cpu, but also to avoid potential deadlock due to
  * lock inversion.
  */
-static void zbud_evict_pages(int nr)
+static long zbud_evict_pages(int nr)
 {
 	struct zbud_page *zbpg;
 	int i;
+	long freed = 0;
 
 	/* first try freeing any pages on unused list */
 retry_unused_list:
@@ -554,6 +555,7 @@ retry_unused_list:
 		spin_unlock_bh(&zbpg_unused_list_spinlock);
 		zcache_free_page(zbpg);
 		zcache_evicted_raw_pages++;
+		freed++;
 		if (--nr <= 0)
 			goto out;
 		goto retry_unused_list;
@@ -578,6 +580,7 @@ retry_unbud_list_i:
 			/* want budlists unlocked when doing zbpg eviction */
 			zbud_evict_zbpg(zbpg);
 			local_bh_enable();
+			freed++;
 			if (--nr <= 0)
 				goto out;
 			goto retry_unbud_list_i;
@@ -602,13 +605,14 @@ retry_bud_list:
 		/* want budlists unlocked when doing zbpg eviction */
 		zbud_evict_zbpg(zbpg);
 		local_bh_enable();
+		freed++;
 		if (--nr <= 0)
 			goto out;
 		goto retry_bud_list;
 	}
 	spin_unlock_bh(&zbud_budlists_spinlock);
 out:
-	return;
+	return freed;
 }
 
 static void __init zbud_init(void)
@@ -1527,26 +1531,28 @@ static bool zcache_freeze;
 /*
  * zcache shrinker interface (only useful for ephemeral pages, so zbud only)
  */
-static int shrink_zcache_memory(struct shrinker *shrink,
-				struct shrink_control *sc)
+static long
+zcache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
-	int ret = -1;
-	int nr = sc->nr_to_scan;
-	gfp_t gfp_mask = sc->gfp_mask;
+	if (!(sc->gfp_mask & __GFP_FS))
+		return -1;
 
-	if (nr >= 0) {
-		if (!(gfp_mask & __GFP_FS))
-			/* does this case really need to be skipped? */
-			goto out;
-		zbud_evict_pages(nr);
-	}
-	ret = (int)atomic_read(&zcache_zbud_curr_raw_pages);
-out:
-	return ret;
+	return zbud_evict_pages(sc->nr_to_scan);
+}
+
+static long
+zcache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	return (long)atomic_read(&zcache_zbud_curr_raw_pages);
 }
 
 static struct shrinker zcache_shrinker = {
-	.shrink = shrink_zcache_memory,
+	.count_objects = zcache_shrink_count,
+	.scan_objects = zcache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 18/19] shrinker: convert remaining shrinkers to count/scan API
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (16 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-27 23:14 ` [PATCH 19/19] shrinker: Kill old ->shrink API Dave Chinner
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the remaining couple of random shrinkers in the tree to the
new API.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c |   35 +++++++++++++++++++++++++----------
 net/sunrpc/auth.c  |   45 +++++++++++++++++++++++++++++++--------------
 2 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6f85fe0..3dbc3c0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4212,26 +4212,28 @@ restart:
 	spin_unlock(&kvm->mmu_lock);
 }
 
-static void kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
+static long kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
 						struct list_head *invalid_list)
 {
 	struct kvm_mmu_page *page;
 
 	if (list_empty(&kvm->arch.active_mmu_pages))
-		return;
+		return 0;
 
 	page = container_of(kvm->arch.active_mmu_pages.prev,
 			    struct kvm_mmu_page, link);
-	kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
+	return kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
 }
 
-static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
+
+static long
+mmu_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
 {
 	struct kvm *kvm;
 	int nr_to_scan = sc->nr_to_scan;
-
-	if (nr_to_scan == 0)
-		goto out;
+	long freed = 0;
 
 	raw_spin_lock(&kvm_lock);
 
@@ -4259,24 +4261,37 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
 		idx = srcu_read_lock(&kvm->srcu);
 		spin_lock(&kvm->mmu_lock);
 
-		kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
+		freed += kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
 		kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
 		spin_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, idx);
 
+		/*
+		 * unfair on small ones
+		 * per-vm shrinkers cry out
+		 * sadness comes quickly
+		 */
 		list_move_tail(&kvm->vm_list, &vm_list);
 		break;
 	}
 
 	raw_spin_unlock(&kvm_lock);
+	return freed;
 
-out:
+}
+
+static long
+mmu_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
 	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
 }
 
 static struct shrinker mmu_shrinker = {
-	.shrink = mmu_shrink,
+	.count_objects = mmu_shrink_count,
+	.scan_objects = mmu_shrink_scan,
 	.seeks = DEFAULT_SEEKS * 10,
 };
 
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index b5c067b..969c629 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -340,12 +340,13 @@ EXPORT_SYMBOL_GPL(rpcauth_destroy_credcache);
 /*
  * Remove stale credentials. Avoid sleeping inside the loop.
  */
-static int
+static long
 rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 {
 	spinlock_t *cache_lock;
 	struct rpc_cred *cred, *next;
 	unsigned long expired = jiffies - RPC_AUTH_EXPIRY_MORATORIUM;
+	long freed = 0;
 
 	list_for_each_entry_safe(cred, next, &cred_unused, cr_lru) {
 
@@ -357,10 +358,11 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 		 */
 		if (time_in_range(cred->cr_expire, expired, jiffies) &&
 		    test_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags) != 0)
-			return 0;
+			break;
 
 		list_del_init(&cred->cr_lru);
 		number_cred_unused--;
+		freed++;
 		if (atomic_read(&cred->cr_count) != 0)
 			continue;
 
@@ -373,29 +375,43 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 		}
 		spin_unlock(cache_lock);
 	}
-	return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
+	return freed;
 }
 
 /*
  * Run memory cache shrinker.
  */
-static int
-rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+rpcauth_cache_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+
 {
 	LIST_HEAD(free);
-	int res;
-	int nr_to_scan = sc->nr_to_scan;
-	gfp_t gfp_mask = sc->gfp_mask;
+	long freed;
+
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		return -1;
 
-	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return (nr_to_scan == 0) ? 0 : -1;
+	/* nothing left, don't come back */
 	if (list_empty(&cred_unused))
-		return 0;
+		return -1;
+
 	spin_lock(&rpc_credcache_lock);
-	res = rpcauth_prune_expired(&free, nr_to_scan);
+	freed = rpcauth_prune_expired(&free, sc->nr_to_scan);
 	spin_unlock(&rpc_credcache_lock);
 	rpcauth_destroy_credlist(&free);
-	return res;
+
+	return freed;
+}
+
+static long
+rpcauth_cache_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+
+{
+	return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 /*
@@ -712,7 +728,8 @@ rpcauth_uptodatecred(struct rpc_task *task)
 }
 
 static struct shrinker rpc_cred_shrinker = {
-	.shrink = rpcauth_cache_shrinker,
+	.count_objects = rpcauth_cache_shrink_count,
+	.scan_objects = rpcauth_cache_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
 };
 
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 19/19] shrinker: Kill old ->shrink API.
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (17 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 18/19] shrinker: convert remaining shrinkers to " Dave Chinner
@ 2012-11-27 23:14 ` Dave Chinner
  2012-11-29 19:02 ` [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Andi Kleen
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-27 23:14 UTC (permalink / raw)
  To: glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

There are no more users of this API, so kill it dead, dead, dead and
quietly bury the corpse in a shallow, unmarked grave in a dark
forest deep in the hills...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h      |   15 +++++----------
 include/trace/events/vmscan.h |    4 ++--
 mm/vmscan.c                   |   39 ++++++++-------------------------------
 3 files changed, 15 insertions(+), 43 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index e71286f..d4636a0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -7,14 +7,15 @@
  *
  * The 'gfpmask' refers to the allocation we are currently trying to
  * fulfil.
- *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
 	gfp_t gfp_mask;
 
-	/* How many slab objects shrinker() should scan and try to reclaim */
+	/*
+	 * How many objects scan_objects should scan and try to reclaim.
+	 * This is reset before every call, so it is safe for callees
+	 * to modify.
+	 */
 	long nr_to_scan;
 
 	/* shrink from these nodes */
@@ -24,11 +25,6 @@ struct shrink_control {
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * @shrink() should look through the least-recently-used 'nr_to_scan' entries
- * and attempt to free them up.  It should return the number of objects which
- * remain in the cache.  If it returns -1, it means it cannot do any scanning at
- * this time (eg. there is a risk of deadlock).
- *
  * @count_objects should return the number of freeable items in the cache. If
  * there are no objects to free or the number of freeable items cannot be
  * determined, it should return 0. No deadlock checks should be done during the
@@ -44,7 +40,6 @@ struct shrink_control {
  * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
-	int (*shrink)(struct shrinker *, struct shrink_control *sc);
 	long (*count_objects)(struct shrinker *, struct shrink_control *sc);
 	long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 63cfccc..132a985 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -202,7 +202,7 @@ TRACE_EVENT(mm_shrink_slab_start,
 
 	TP_fast_assign(
 		__entry->shr = shr;
-		__entry->shrink = shr->shrink;
+		__entry->shrink = shr->scan_objects;
 		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
 		__entry->gfp_flags = sc->gfp_mask;
 		__entry->pgs_scanned = pgs_scanned;
@@ -241,7 +241,7 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 	TP_fast_assign(
 		__entry->shr = shr;
-		__entry->shrink = shr->shrink;
+		__entry->shrink = shr->scan_objects;
 		__entry->unused_scan = unused_scan_cnt;
 		__entry->new_scan = new_scan_cnt;
 		__entry->retval = shrinker_retval;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4a602ec..81731f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -176,14 +176,6 @@ void unregister_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-static inline int do_shrinker_shrink(struct shrinker *shrinker,
-				     struct shrink_control *sc,
-				     unsigned long nr_to_scan)
-{
-	sc->nr_to_scan = nr_to_scan;
-	return (*shrinker->shrink)(shrinker, sc);
-}
-
 #define SHRINK_BATCH 128
 /*
  * Call the shrink functions to age shrinkable caches
@@ -229,11 +221,8 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
-		if (shrinker->scan_objects) {
-			max_pass = shrinker->count_objects(shrinker, sc);
-			WARN_ON(max_pass < 0);
-		} else
-			max_pass = do_shrinker_shrink(shrinker, sc, 0);
+		max_pass = shrinker->count_objects(shrinker, sc);
+		WARN_ON(max_pass < 0);
 		if (max_pass <= 0)
 			continue;
 
@@ -252,7 +241,7 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		if (total_scan < 0) {
 			printk(KERN_ERR
 			"shrink_slab: %pF negative objects to delete nr=%ld\n",
-			       shrinker->shrink, total_scan);
+			       shrinker->scan_objects, total_scan);
 			total_scan = max_pass;
 		}
 
@@ -286,24 +275,12 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		while (total_scan >= batch_size) {
 			long ret;
 
-			if (shrinker->scan_objects) {
-				sc->nr_to_scan = batch_size;
-				ret = shrinker->scan_objects(shrinker, sc);
+			sc->nr_to_scan = batch_size;
+			ret = shrinker->scan_objects(shrinker, sc);
 
-				if (ret == -1)
-					break;
-				freed += ret;
-			} else {
-				int nr_before;
-
-				nr_before = do_shrinker_shrink(shrinker, sc, 0);
-				ret = do_shrinker_shrink(shrinker, sc,
-								batch_size);
-				if (ret == -1)
-					break;
-				if (ret < nr_before)
-					freed += nr_before - ret;
-			}
+			if (ret == -1)
+				break;
+			freed += ret;
 			count_vm_events(SLABS_SCANNED, batch_size);
 			total_scan -= batch_size;
 
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
@ 2012-11-28  1:13   ` Chris Wilson
  2012-11-28  3:17     ` Dave Chinner
  2013-06-07 13:37   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 57+ messages in thread
From: Chris Wilson @ 2012-11-28  1:13 UTC (permalink / raw)
  To: Dave Chinner, glommer; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> + * size is PAGE_SIZE.

No, purge() expects a count of pages to be freed. Each pass of the
shrinker therefore tries to free a minimum of 128 pages.

> The shrinker doesn't work on bytes - it works on
> + * *objects*.

And I thought you were reviewing the shrinker API to be useful where a
single object may range between 4K and 4G.

> So it passes a nr_to_scan of 128 objects, which is interpreted
> + * here to mean "free 128 bytes". That means a single object will be freed, as
> + * the minimum object size is a page.
> + *
> + * But the craziest part comes when i915_gem_purge() has walked all the objects
> + * and can't free any memory. That results in i915_gem_shrink_all() being
> + * called, which idles the GPU and frees everything the driver has in it's
> + * active and inactive lists. It's basically hitting the driver with a great big
> + * hammer because it was busy doing stuff when something else generated memory
> + * pressure. This doesn't seem particularly wise...
> + */

As opposed to triggering an OOM? The choice was between custom code for
a hopefully rare code path in a situation of last resort, or first
implementing the simplest code that stopped i915 from starving the
system of memory.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-28  1:13   ` Chris Wilson
@ 2012-11-28  3:17     ` Dave Chinner
  2012-11-28  8:21       ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-11-28  3:17 UTC (permalink / raw)
  To: Chris Wilson; +Cc: glommer, linux-kernel, linux-fsdevel, linux-mm, xfs

On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
> > +/*
> > + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> > + *
> > + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> > + * size is PAGE_SIZE.
> 
> No, purge() expects a count of pages to be freed. Each pass of the
> shrinker therefore tries to free a minimum of 128 pages.

Ah, I got the shifts mixed up. I'd been looking at way too much crap
already when I saw this. But the fact this can be misunderstood says
something about the level of documentation that the code has (i.e.
none).

> > The shrinker doesn't work on bytes - it works on
> > + * *objects*.
> 
> And I thought you were reviewing the shrinker API to be useful where a
> single object may range between 4K and 4G.

Which requires rewriting all the algorithms to not be dependent on
the subsystems using a fixed size object. The shrinker control
function is called shrink_slab() for a reason - it was expected to
be used to shrink caches of fixed sized objects allocated from slab
memory.

It has no concept of the amount of memory that each object consumes,
just an idea of how much *IO* it takes to replace the object in
memory once it's been reclaimed. The DEFAULT_SEEKS is design to
encode the fact it generally takes 2 IOs to replace either a LRU
page or a filesystem slab object, and so balances the scanning based
on that value. i.e. the shrinker algorithms are solidly based around
fixed sized objects that have some relationship to the cost of
physical IO operations to replace them in the cache.

The API change is the first step in the path to removing these built
in assumptions. The current API is just insane and any attempt to
build on it is going to be futile. The way I see this developing is
this:

	- make the shrink_slab count -> scan algorithm per node

	- add information about size of objects in the cache for
	  fixed size object caches.
		- the shrinker now has some idea of how many objects
		  need to be freed to be able to free a page of
		  memory, as well as the relative penalty for
		  replacing them.
		- tells the shrinker the size of the cache
		  in bytes so overall memory footprint of the caches
		  can be taken into account

	- add new count and scan operations for caches that are
	  based on memory used, not object counts
		- allows us to use the same count/scan algorithm for
		  calculating how much pressure to put on caches
		  with variable size objects.

My care factor mostly ends here, as it will allow XFS to corectly
balance the metadata buffer cache (variable size objects) against the
inode, dentry and dquot caches which are object based. The next
steps that I'm about to give you are based on some discussions with
some MM people over bottles of red wine, so take it with a grain of
salt...

	- calculate a "pressure" value for each cache controlled by a
	  shrinker so that the relative memory pressure between
	  caches can be compared. This allows the shrinkers to bias
	  reclaim based on where the memory pressure is being
	  generated

	- start grouping shrinkers into a heirarchy, allowing
	  related shrinkers (e.g. all the caches in a memcg) to be
	  shrunk according resource limits that can be placed on the
	  group. i.e. memory pressure is proportioned across
	  groups rather than many individual shrinkers.

	- comments have been made to the extent that with generic
	  per-node lists and a node aware shrinker, all of the page
	  scanning could be driven by the shrinker infrastructure,
	  rather than the shrinkers being driven by how many pages
	  in the page cache just got scanned for reclaim.

	  IOWs, the main memory reclaim algorithm walks all the
	  shrinkers groups to calculate overall memory pressure,
	  calculate how much reclaim is necessary, and then
	  proportion reclaim across all the shrinker groups. i.e.
	  everything is a shrinker.

This patch set is really just the start of a long process. balance
between the page cache and VFS/filesystem shrinkers is critical to
the efficient operation of the OS under many, many workloads, so I'm
not about to change more than oe little thing at a time. This API
change is just one little step. You'll get what you want eventually,
but you're not going to get it as a first step.

> > + * But the craziest part comes when i915_gem_purge() has walked all the objects
> > + * and can't free any memory. That results in i915_gem_shrink_all() being
> > + * called, which idles the GPU and frees everything the driver has in it's
> > + * active and inactive lists. It's basically hitting the driver with a great big
> > + * hammer because it was busy doing stuff when something else generated memory
> > + * pressure. This doesn't seem particularly wise...
> > + */
> 
> As opposed to triggering an OOM? The choice was between custom code for
> a hopefully rare code path in a situation of last resort, or first
> implementing the simplest code that stopped i915 from starving the
> system of memory.

And when it's something else that is causing the memory pressue?
The shrinker gets called whenever somethign runs low on memory - it
might be called thousands of times a second so there's a very good
chance you have very little to free after the first purge has
occurred. After than you're going to idle the GPU and purge all the
memory on every single shrinker call, even though the GPU is not
generating memory pressure and not causing the shrinkers to run.
That's why it's crazy - it's got close to worst case behaviour when
the GPU is already using as little memory as possible.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-28  3:17     ` Dave Chinner
@ 2012-11-28  8:21       ` Glauber Costa
  2012-11-28 21:28         ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2012-11-28  8:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Wilson, linux-kernel, linux-fsdevel, linux-mm, xfs

On 11/28/2012 07:17 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
>>> +/*
>>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
>>> + *
>>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
>>> + * size is PAGE_SIZE.
>>
>> No, purge() expects a count of pages to be freed. Each pass of the
>> shrinker therefore tries to free a minimum of 128 pages.
> 
> Ah, I got the shifts mixed up. I'd been looking at way too much crap
> already when I saw this. But the fact this can be misunderstood says
> something about the level of documentation that the code has (i.e.
> none).
> 
>>> The shrinker doesn't work on bytes - it works on
>>> + * *objects*.
>>
>> And I thought you were reviewing the shrinker API to be useful where a
>> single object may range between 4K and 4G.
> 
> Which requires rewriting all the algorithms to not be dependent on
> the subsystems using a fixed size object. The shrinker control
> function is called shrink_slab() for a reason - it was expected to
> be used to shrink caches of fixed sized objects allocated from slab
> memory.
> 
> It has no concept of the amount of memory that each object consumes,
> just an idea of how much *IO* it takes to replace the object in
> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> encode the fact it generally takes 2 IOs to replace either a LRU
> page or a filesystem slab object, and so balances the scanning based
> on that value. i.e. the shrinker algorithms are solidly based around
> fixed sized objects that have some relationship to the cost of
> physical IO operations to replace them in the cache.

One nit: It shouldn't take 2IOs to replace a slab object, right? This
should be the cost of allocating a new page, that can contain, multiple
objects.

Once the page is in, a new object should be quite cheap to come up with.

This is a very wild thought, but now that I am diving deep in the
shrinker API, and seeing things like this:

if (reclaim_state) {
    sc->nr_reclaimed += reclaim_state->reclaimed_slab;
    reclaim_state->reclaimed_slab = 0;
}

I am becoming more convinced that we should have a page-based mechanism,
like the rest of vmscan.

Also, if we are seeing pressure from someone requesting user pages, what
good does it make to free, say, 35 Mb of memory, if this means we are
freeing objects across 5k different pages, without actually releasing
any of them? (still is TBD if this is a theoretical problem or a
practical one). It would maybe be better to free objects that are
moderately hot, but are on pages dominated by cold objects...


> 
> The API change is the first step in the path to removing these built
> in assumptions. The current API is just insane and any attempt to
> build on it is going to be futile. 

Amen, brother!

> The way I see this developing is
> this:
> 
> 	- make the shrink_slab count -> scan algorithm per node
> 
pages are per-node.

> 	- add information about size of objects in the cache for
> 	  fixed size object caches.
> 		- the shrinker now has some idea of how many objects
> 		  need to be freed to be able to free a page of
> 		  memory, as well as the relative penalty for
> 		  replacing them.
this is still guesswork, telling how many pages it should free, could
be a better idea.

> 		- tells the shrinker the size of the cache
> 		  in bytes so overall memory footprint of the caches
> 		  can be taken into account

> 	- add new count and scan operations for caches that are
> 	  based on memory used, not object counts
> 		- allows us to use the same count/scan algorithm for
> 		  calculating how much pressure to put on caches
> 		  with variable size objects.

IOW, pages.

> My care factor mostly ends here, as it will allow XFS to corectly
> balance the metadata buffer cache (variable size objects) against the
> inode, dentry and dquot caches which are object based. The next
> steps that I'm about to give you are based on some discussions with
> some MM people over bottles of red wine, so take it with a grain of
> salt...
> 
> 	- calculate a "pressure" value for each cache controlled by a
> 	  shrinker so that the relative memory pressure between
> 	  caches can be compared. This allows the shrinkers to bias
> 	  reclaim based on where the memory pressure is being
> 	  generated
> 

Ok, if a cache is using a lot of memory, this would indicate it has the
dominant workload, right? Should we free from it, or should we free from
the others, so this ones gets the pages it needs?

> 	- start grouping shrinkers into a heirarchy, allowing
> 	  related shrinkers (e.g. all the caches in a memcg) to be
> 	  shrunk according resource limits that can be placed on the
> 	  group. i.e. memory pressure is proportioned across
> 	  groups rather than many individual shrinkers.
> 
pages are already grouped like that!

> 	- comments have been made to the extent that with generic
> 	  per-node lists and a node aware shrinker, all of the page
> 	  scanning could be driven by the shrinker infrastructure,
> 	  rather than the shrinkers being driven by how many pages
> 	  in the page cache just got scanned for reclaim.
> 
> 	  IOWs, the main memory reclaim algorithm walks all the
> 	  shrinkers groups to calculate overall memory pressure,
> 	  calculate how much reclaim is necessary, and then
> 	  proportion reclaim across all the shrinker groups. i.e.
> 	  everything is a shrinker.
> 
> This patch set is really just the start of a long process. balance
> between the page cache and VFS/filesystem shrinkers is critical to
> the efficient operation of the OS under many, many workloads, so I'm
> not about to change more than oe little thing at a time. This API
> change is just one little step. You'll get what you want eventually,
> but you're not going to get it as a first step.
> 

I have to note again that this is my first *serious* look at the
problem... but this is a summary of what I got, that fits in the context
of this particular discussion =)

I still have to go through all your other patches...

But one thing we seem to agree is that we have quite a long road ahead

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 06/19] list: add a new LRU list type
  2012-11-27 23:14 ` [PATCH 06/19] list: add a new LRU list type Dave Chinner
@ 2012-11-28 16:10   ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2012-11-28 16:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: glommer, linux-kernel, linux-fsdevel, linux-mm, xfs

On Wed, Nov 28, 2012 at 10:14:33AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Several subsystems use the same construct for LRU lists - a list
> head, a spin lock and and item count. They also use exactly the same
> code for adding and removing items from the LRU. Create a generic
> type for these LRU lists.
> 
> This is the beginning of generic, node aware LRUs for shrinkers to
> work with.

Can you please add kerneldoc comments for the functions, and add
symbolic constants for the possible return values from the isolate
callback?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 15/19] xfs: convert dquot cache lru to list_lru
  2012-11-27 23:14 ` [PATCH 15/19] xfs: convert dquot cache lru to list_lru Dave Chinner
@ 2012-11-28 16:17   ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2012-11-28 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: glommer, linux-fsdevel, linux-mm, linux-kernel, xfs

> +	if (!xfs_dqflock_nowait(dqp))
> +		xfs_dqunlock(dqp);
> +		goto out_miss_busy;

This seems to miss braces.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-28  8:21       ` Glauber Costa
@ 2012-11-28 21:28         ` Dave Chinner
  2012-11-29 10:29           ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-11-28 21:28 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Chris Wilson, linux-kernel, linux-fsdevel, linux-mm, xfs

On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
> >> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
> >>> +/*
> >>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> >>> + *
> >>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> >>> + * size is PAGE_SIZE.
> >>
> >> No, purge() expects a count of pages to be freed. Each pass of the
> >> shrinker therefore tries to free a minimum of 128 pages.
> > 
> > Ah, I got the shifts mixed up. I'd been looking at way too much crap
> > already when I saw this. But the fact this can be misunderstood says
> > something about the level of documentation that the code has (i.e.
> > none).
> > 
> >>> The shrinker doesn't work on bytes - it works on
> >>> + * *objects*.
> >>
> >> And I thought you were reviewing the shrinker API to be useful where a
> >> single object may range between 4K and 4G.
> > 
> > Which requires rewriting all the algorithms to not be dependent on
> > the subsystems using a fixed size object. The shrinker control
> > function is called shrink_slab() for a reason - it was expected to
> > be used to shrink caches of fixed sized objects allocated from slab
> > memory.
> > 
> > It has no concept of the amount of memory that each object consumes,
> > just an idea of how much *IO* it takes to replace the object in
> > memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> > encode the fact it generally takes 2 IOs to replace either a LRU
> > page or a filesystem slab object, and so balances the scanning based
> > on that value. i.e. the shrinker algorithms are solidly based around
> > fixed sized objects that have some relationship to the cost of
> > physical IO operations to replace them in the cache.
> 
> One nit: It shouldn't take 2IOs to replace a slab object, right?
> objects.

A random dentry in a small directory will take on IO to read the
inode, then another to read the block the dirent sits in. TO read an
inode froma cached dentry will generally take one IO to read the
inode, and another to read related, out of inode information (e.g.
attributes or extent/block maps). Sometimes it will only take on IO,
sometimes it might take 3 or, in the case of dirents, coult take
hundreds of IOs if the directory structure is large enough.

So a default of 2 seeks to replace any single dentry/inode in the
cache is a pretty good default to use.

> This
> should be the cost of allocating a new page, that can contain, multiple
> Once the page is in, a new object should be quite cheap to come up with.

It's not the cost of allocating the page (a couple of microseconds)
that is being considered - it the 3-4 orders of magnitude worse cost
of reading the object from disk (could be 20ms). The slab/page
allocation is lost in the noise compared to the time it takes to
fill the page cache page with data or a single slab object.
Essentially, slab pages with multiple objects in them are much more
expensive to replace in the cache than a page cache page....

> This is a very wild thought, but now that I am diving deep in the
> shrinker API, and seeing things like this:
> 
> if (reclaim_state) {
>     sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>     reclaim_state->reclaimed_slab = 0;
> }

That's not part of the shrinker - that's part of the vmscan
code, external to the shrinker infrastructure. It's getting
information back from the slab caches behind the shrinkers, and it's
not the full picture because many shrinkers are not backed by slab
caches. It's a work around for not not having accurate feedback from
the shrink_slab() code about how many pages were freed.

Essentially, the problem is an impedance mismatch between the way
the LRUs are scanned/balanced (in pages) and slab caches are managed
(by objects). That's what needs unifying...

> I am becoming more convinced that we should have a page-based mechanism,
> like the rest of vmscan.

Been thought about and consiered before. Would you like to rewrite
the slab code?

> Also, if we are seeing pressure from someone requesting user pages, what
> good does it make to free, say, 35 Mb of memory, if this means we are
> freeing objects across 5k different pages, without actually releasing
> any of them? (still is TBD if this is a theoretical problem or a
> practical one). It would maybe be better to free objects that are
> moderately hot, but are on pages dominated by cold objects...

Yup, that's a problem, but now you're asking shrinker
implementations to know  in great detail the physical locality of
object and not just the temporal locality.  the node-aware LRU list
does this at a coarse level, but to do page based reclaim you need
ot track pages in SL*B that contain unreferenced objects as those
are the only ones that can be reclaimed.

If you have no pages with unreferenced objects, then you don't make
progress and you have to fall back to freeing unreferenced objects
from random pages. ANd under most workloads that aren't benchmarks,
slab object population ends up with little correlation between
physical and temporal locality. Hence this is the norm rather than
the exception..

Also, handling physical locality in this way means we'd need to tie
the shrinker deep into the SLAB/SLUB/SLOB implementation that is
being used to allocate the objects..

There have been various attempts at this sort of thing in the past.
e.g:

http://marc.info/?l=linux-mm&m=112810938004047

or for slab defragmentation:

https://lkml.org/lkml/2010/1/29/332

and more on LRUs in slab caches and general shrinker design in that
thread (definitely worth reading this, at least):

https://lkml.org/lkml/2010/2/2/499

And it's made far more complex by the fact that some shrinkers don't
necessarily free the objects they are working on. e.g. the VFS inode
cache shrinker basically hands objects to XFS, and the XFS inode
cache takes over from there (via the XFS inode cache shrinker) to
free the objects. i.e. two shrinkers act on the same structure...

> > The API change is the first step in the path to removing these built
> > in assumptions. The current API is just insane and any attempt to
> > build on it is going to be futile. 
> 
> Amen, brother!
> 
> > The way I see this developing is
> > this:
> > 
> > 	- make the shrink_slab count -> scan algorithm per node
> > 
> pages are per-node.
> 
> > 	- add information about size of objects in the cache for
> > 	  fixed size object caches.
> > 		- the shrinker now has some idea of how many objects
> > 		  need to be freed to be able to free a page of
> > 		  memory, as well as the relative penalty for
> > 		  replacing them.
> this is still guesswork, telling how many pages it should free, could
> be a better idea.
> 
> > 		- tells the shrinker the size of the cache
> > 		  in bytes so overall memory footprint of the caches
> > 		  can be taken into account
> 
> > 	- add new count and scan operations for caches that are
> > 	  based on memory used, not object counts
> > 		- allows us to use the same count/scan algorithm for
> > 		  calculating how much pressure to put on caches
> > 		  with variable size objects.
> 
> IOW, pages.

Not necessarily - if we are going to deal with multiple objects in a
page as well as multi-page objects, then pages are unable to express
the full range of possibilities. We may as well use byte counts at
this point, expecially when you consider page sizes differ on
different platforms...

> > My care factor mostly ends here, as it will allow XFS to corectly
> > balance the metadata buffer cache (variable size objects) against the
> > inode, dentry and dquot caches which are object based. The next
> > steps that I'm about to give you are based on some discussions with
> > some MM people over bottles of red wine, so take it with a grain of
> > salt...
> > 
> > 	- calculate a "pressure" value for each cache controlled by a
> > 	  shrinker so that the relative memory pressure between
> > 	  caches can be compared. This allows the shrinkers to bias
> > 	  reclaim based on where the memory pressure is being
> > 	  generated
> > 
> 
> Ok, if a cache is using a lot of memory, this would indicate it has the
> dominant workload, right?

Not necessarily. Someone might jus thave run a find across their
filesystem, and that is where all the pressure is coming from. In
this case, you don't want that memory prssure to toss out all the
other caches. I suspect that the "pressure" measure is going to need
to take into account cache hit rates to work properly...

> Should we free from it, or should we free from
> the others, so this ones gets the pages it needs?

That's the million dollar question...

> > 	- start grouping shrinkers into a heirarchy, allowing
> > 	  related shrinkers (e.g. all the caches in a memcg) to be
> > 	  shrunk according resource limits that can be placed on the
> > 	  group. i.e. memory pressure is proportioned across
> > 	  groups rather than many individual shrinkers.
> > 
> pages are already grouped like that!

But shrinkers and slab caches are not.

Besides, once you start grouping shrinkers, why should we treat the
page LRU list scanning any differently from any other cache that has
a shrinker? 

> But one thing we seem to agree is that we have quite a long road ahead

Definitely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-28 21:28         ` Dave Chinner
@ 2012-11-29 10:29           ` Glauber Costa
  2012-11-29 22:02             ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2012-11-29 10:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Wilson, linux-kernel, linux-fsdevel, linux-mm, xfs

On 11/29/2012 01:28 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
>> On 11/28/2012 07:17 AM, Dave Chinner wrote:
>>> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
>>>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
>>>>> +/*
>>>>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
>>>>> + *
>>>>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
>>>>> + * size is PAGE_SIZE.
>>>>
>>>> No, purge() expects a count of pages to be freed. Each pass of the
>>>> shrinker therefore tries to free a minimum of 128 pages.
>>>
>>> Ah, I got the shifts mixed up. I'd been looking at way too much crap
>>> already when I saw this. But the fact this can be misunderstood says
>>> something about the level of documentation that the code has (i.e.
>>> none).
>>>
>>>>> The shrinker doesn't work on bytes - it works on
>>>>> + * *objects*.
>>>>
>>>> And I thought you were reviewing the shrinker API to be useful where a
>>>> single object may range between 4K and 4G.
>>>
>>> Which requires rewriting all the algorithms to not be dependent on
>>> the subsystems using a fixed size object. The shrinker control
>>> function is called shrink_slab() for a reason - it was expected to
>>> be used to shrink caches of fixed sized objects allocated from slab
>>> memory.
>>>
>>> It has no concept of the amount of memory that each object consumes,
>>> just an idea of how much *IO* it takes to replace the object in
>>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
>>> encode the fact it generally takes 2 IOs to replace either a LRU
>>> page or a filesystem slab object, and so balances the scanning based
>>> on that value. i.e. the shrinker algorithms are solidly based around
>>> fixed sized objects that have some relationship to the cost of
>>> physical IO operations to replace them in the cache.
>>
>> One nit: It shouldn't take 2IOs to replace a slab object, right?
>> objects.
> 
> A random dentry in a small directory will take on IO to read the
> inode, then another to read the block the dirent sits in. TO read an
> inode froma cached dentry will generally take one IO to read the
> inode, and another to read related, out of inode information (e.g.
> attributes or extent/block maps). Sometimes it will only take on IO,
> sometimes it might take 3 or, in the case of dirents, coult take
> hundreds of IOs if the directory structure is large enough.
> 
> So a default of 2 seeks to replace any single dentry/inode in the
> cache is a pretty good default to use.
> 
>> This
>> should be the cost of allocating a new page, that can contain, multiple
>> Once the page is in, a new object should be quite cheap to come up with.
> 

Indeed. More on this in the next paragraph...

> It's not the cost of allocating the page (a couple of microseconds)
> that is being considered - it the 3-4 orders of magnitude worse cost
> of reading the object from disk (could be 20ms). The slab/page
> allocation is lost in the noise compared to the time it takes to
> fill the page cache page with data or a single slab object.
> Essentially, slab pages with multiple objects in them are much more
> expensive to replace in the cache than a page cache page....
> 
>> This is a very wild thought, but now that I am diving deep in the
>> shrinker API, and seeing things like this:
>>
>> if (reclaim_state) {
>>     sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>>     reclaim_state->reclaimed_slab = 0;
>> }
> 
> That's not part of the shrinker - that's part of the vmscan
> code, external to the shrinker infrastructure. It's getting
> information back from the slab caches behind the shrinkers, and it's
> not the full picture because many shrinkers are not backed by slab
> caches. It's a work around for not not having accurate feedback from
> the shrink_slab() code about how many pages were freed.
> 
I know it is not part of the shrinkers, and that is precisely my point.
vmscan needs to go through this kinds of hacks because our API is not
strong enough to just give it back the answer that matters to the caller.

> Essentially, the problem is an impedance mismatch between the way
> the LRUs are scanned/balanced (in pages) and slab caches are managed
> (by objects). That's what needs unifying...
> 
So read my statement again, Dave: this is precisely what I am advocating!

The fact that you are so more concerned with bringing the dentries back
from disk is just an obvious consequence of your FS background. The
problem I was more concerned, is when a user needs to allocate a page
for whatever reason. We're short on pages, and then we shrink. But the
shrink gives us nothing. If this is a user-page driven workload, it
should be better to do this, than to get rid of user pages - which we
may end up doing if the shrinkers does not release enough pages. This is
in contrast with a dcache-driven workload, where what you are saying
makes total sense.

In fact, those goals are so orthogonal, that I wonder if it wouldn't be
worth it to introduce some kind of pressure measurement to determine if
the workload is kernel-driven or userpages-driven.

I am right now experimenting with something (dcache only for starters)
like this (simplified version):

static void dentry_lru_add(struct dentry *dentry)
{
        if (!dentry->score) {
                struct page *page;
                page = virt_to_head_page(dentry);
                page->slab_score_page += score;
                dentry->score = score();
        }
}

static void __dentry_lru_del(struct dentry *dentry)
{
        struct page *page;
        page = virt_to_head_page(dentry);

        dentry->d_flags &= ~DCACHE_SHRINK_LIST;
        page->slab_score_page += dentry->score();
}


score decreases as the time passes. So if a page has a very large score,
it means that it is likely to have a lot of objects that are somewhat
old. When we scan, we start from them. If this is a purely kernel driven
workload, we only delete the objects that are, itself, old. If this is a
user-driven workload, we try to take down the rest as well. With a
grey-area in the middle, maybe...

In fact, this may (I am not bold enough to state anything as certain at
this point!!)  free a lot *less* dentries than the normal shrinkers
would, while still being able to give userspace users what it really
cares about: pages!

The big advantage of scanning through pages, is that it becomes trivial
to either walk it per-zone, or per-memcg, since you can trivially derive
the set of pages that belong to each of them.

The determination of user vs kernel driven workloads can also be done
per-{memcg,zone}. So we can conceivably have a {memcg,zone} that is
receiving a lot of kernel-objects pressure, and another that is
receiving a lot of user-driven pressure, due to the different nature of
their workloads. We would shrink them differently, in this case.

>> I am becoming more convinced that we should have a page-based mechanism,
>> like the rest of vmscan.
> 
> Been thought about and consiered before. Would you like to rewrite
> the slab code?
> 
It would be fun!

>> Also, if we are seeing pressure from someone requesting user pages, what
>> good does it make to free, say, 35 Mb of memory, if this means we are
>> freeing objects across 5k different pages, without actually releasing
>> any of them? (still is TBD if this is a theoretical problem or a
>> practical one). It would maybe be better to free objects that are
>> moderately hot, but are on pages dominated by cold objects...
> 
> Yup, that's a problem, but now you're asking shrinker
> implementations to know  in great detail the physical locality of
> object and not just the temporal locality.  the node-aware LRU list
> does this at a coarse level, but to do page based reclaim you need
> ot track pages in SL*B that contain unreferenced objects as those
> are the only ones that can be reclaimed.
> 
Yes, this is the part that I am still struggling with. What I am
currently trying is to come up with a stable SL*B api that will allow us
to know which objects are alive in a given page, and how to walk them.

Combined with the score mechanism above, we can define minimum scores
for an object to be freed - which relates to how long ago it was marked
- and then free the objects in the page that has a high enough score. If
pressure is kept, we lower the threshold. Do this n times.

> If you have no pages with unreferenced objects, then you don't make
> progress and you have to fall back to freeing unreferenced objects
> from random pages. ANd under most workloads that aren't benchmarks,
> slab object population ends up with little correlation between
> physical and temporal locality. Hence this is the norm rather than
> the exception..
> 

More or less.

Last time I counted, a dentry had 248 bytes. Let's say 256 for
simplicity. So each page will have around 16 dentries (it is actually
usually a bit less than that, due to slab metadata).

In a normal shrink, how many objects do you expect to free ? At some
point, as your scan rate grows, the likelihood of having a bunch of them
in the same page increases, even if not all objects in that page are
cold. But still, we can easily scan the objects in a page with a high
score, and find a bunch of objects that are cold. If it is a
kernel-driven workload, we move on to the next page when they are out.
If it is user driven, it may be worth it to free the others, and give
the user what it wants - a page. Leaving all other thousands of dentries
untouched.


> Also, handling physical locality in this way means we'd need to tie
> the shrinker deep into the SLAB/SLUB/SLOB implementation that is
> being used to allocate the objects..
> 
> There have been various attempts at this sort of thing in the past.
> e.g:
> 
> http://marc.info/?l=linux-mm&m=112810938004047
> 
> or for slab defragmentation:
> 
> https://lkml.org/lkml/2010/1/29/332
> 
> and more on LRUs in slab caches and general shrinker design in that
> thread (definitely worth reading this, at least):
> 
> https://lkml.org/lkml/2010/2/2/499
> 

Thanks for the pointers. I will read them right now. A lot of good ideas
end up being killed by the unseen details, so if I radically change my
mind, I will let you know =)

> And it's made far more complex by the fact that some shrinkers don't
> necessarily free the objects they are working on. e.g. the VFS inode
> cache shrinker basically hands objects to XFS, and the XFS inode
> cache takes over from there (via the XFS inode cache shrinker) to
> free the objects. i.e. two shrinkers act on the same structure...
>
I will dig into this as well. Right now, I don't really understand it.

>>> The API change is the first step in the path to removing these built
>>> in assumptions. The current API is just insane and any attempt to
>>> build on it is going to be futile. 
>>
>> Amen, brother!
>>
>>> The way I see this developing is
>>> this:
>>>
>>> 	- make the shrink_slab count -> scan algorithm per node
>>>
>> pages are per-node.
>>
>>> 	- add information about size of objects in the cache for
>>> 	  fixed size object caches.
>>> 		- the shrinker now has some idea of how many objects
>>> 		  need to be freed to be able to free a page of
>>> 		  memory, as well as the relative penalty for
>>> 		  replacing them.
>> this is still guesswork, telling how many pages it should free, could
>> be a better idea.
>>
>>> 		- tells the shrinker the size of the cache
>>> 		  in bytes so overall memory footprint of the caches
>>> 		  can be taken into account
>>
>>> 	- add new count and scan operations for caches that are
>>> 	  based on memory used, not object counts
>>> 		- allows us to use the same count/scan algorithm for
>>> 		  calculating how much pressure to put on caches
>>> 		  with variable size objects.
>>
>> IOW, pages.
> 
> Not necessarily - if we are going to deal with multiple objects in a
> page as well as multi-page objects, then pages are unable to express
> the full range of possibilities. We may as well use byte counts at
> this point, expecially when you consider page sizes differ on
> different platforms...
> 

bytecounts by themselves doesn't seem that bad to me. It also makes
sense. But I would still note that for that memory to be useful, it must
be in consumable units. What is a consumable unit depends on where the
pressure comes from.

An object is a consumable unit, so if the pressure is coming from the
dentry cache, it is great to free 4000 consumable units - objects. If it
is coming from the page cache, it would be better to free as many pages
as we can, because that is what can be used.

>>> My care factor mostly ends here, as it will allow XFS to corectly
>>> balance the metadata buffer cache (variable size objects) against the
>>> inode, dentry and dquot caches which are object based. The next
>>> steps that I'm about to give you are based on some discussions with
>>> some MM people over bottles of red wine, so take it with a grain of
>>> salt...
>>>
>>> 	- calculate a "pressure" value for each cache controlled by a
>>> 	  shrinker so that the relative memory pressure between
>>> 	  caches can be compared. This allows the shrinkers to bias
>>> 	  reclaim based on where the memory pressure is being
>>> 	  generated
>>>
>>
>> Ok, if a cache is using a lot of memory, this would indicate it has the
>> dominant workload, right?
> 
> Not necessarily. Someone might jus thave run a find across their
> filesystem, and that is where all the pressure is coming from. In
> this case, you don't want that memory prssure to toss out all the
> other caches. I suspect that the "pressure" measure is going to need
> to take into account cache hit rates to work properly...
> 
Indeed you are right, memory usage is a very bad estimate. Let me stand
corrected: memory usage over a sliding time window, or something like
the such.

>> Should we free from it, or should we free from
>> the others, so this ones gets the pages it needs?
> 
> That's the million dollar question...
> 

For god's sake, it's time we stop this bullshit of telling people
"that's the million dollar question". Those fucktards are printing money
like crazy, so we need to correct it for inflation, at the very least.

This is the USD 1,289,648.82 question, using Y2K as a baseline.


>>> 	- start grouping shrinkers into a heirarchy, allowing
>>> 	  related shrinkers (e.g. all the caches in a memcg) to be
>>> 	  shrunk according resource limits that can be placed on the
>>> 	  group. i.e. memory pressure is proportioned across
>>> 	  groups rather than many individual shrinkers.
>>>
>> pages are already grouped like that!
> 
> But shrinkers and slab caches are not.
> 
> Besides, once you start grouping shrinkers, why should we treat the
> page LRU list scanning any differently from any other cache that has
> a shrinker? 
> 

Dunno. Should we?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (18 preceding siblings ...)
  2012-11-27 23:14 ` [PATCH 19/19] shrinker: Kill old ->shrink API Dave Chinner
@ 2012-11-29 19:02 ` Andi Kleen
  2012-11-29 22:09   ` Dave Chinner
  2012-12-20 11:45 ` Glauber Costa
  2013-01-21 16:08 ` Glauber Costa
  21 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2012-11-29 19:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: glommer, linux-kernel, linux-fsdevel, linux-mm, xfs

Dave Chinner <david@fromorbit.com> writes:
>
> Comments, thoughts and flames all welcome.

Doing the reclaim per CPU sounds like a big change in the VM balance. 
Doesn't this invalidate some zone reclaim mode settings?
How did you validate all this?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-29 10:29           ` Glauber Costa
@ 2012-11-29 22:02             ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-29 22:02 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Chris Wilson, linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, Nov 29, 2012 at 02:29:33PM +0400, Glauber Costa wrote:
> On 11/29/2012 01:28 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> >> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> >>> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
> >>>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
> >>>>> The shrinker doesn't work on bytes - it works on
> >>>>> + * *objects*.
> >>>>
> >>>> And I thought you were reviewing the shrinker API to be useful where a
> >>>> single object may range between 4K and 4G.
> >>>
> >>> Which requires rewriting all the algorithms to not be dependent on
> >>> the subsystems using a fixed size object. The shrinker control
> >>> function is called shrink_slab() for a reason - it was expected to
> >>> be used to shrink caches of fixed sized objects allocated from slab
> >>> memory.
> >>>
> >>> It has no concept of the amount of memory that each object consumes,
> >>> just an idea of how much *IO* it takes to replace the object in
> >>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> >>> encode the fact it generally takes 2 IOs to replace either a LRU
> >>> page or a filesystem slab object, and so balances the scanning based
> >>> on that value. i.e. the shrinker algorithms are solidly based around
> >>> fixed sized objects that have some relationship to the cost of
> >>> physical IO operations to replace them in the cache.
> >>
> >> One nit: It shouldn't take 2IOs to replace a slab object, right?
> >> objects.
> > 
> > A random dentry in a small directory will take on IO to read the
> > inode, then another to read the block the dirent sits in. TO read an
> > inode froma cached dentry will generally take one IO to read the
> > inode, and another to read related, out of inode information (e.g.
> > attributes or extent/block maps). Sometimes it will only take on IO,
> > sometimes it might take 3 or, in the case of dirents, coult take
> > hundreds of IOs if the directory structure is large enough.
> > 
> > So a default of 2 seeks to replace any single dentry/inode in the
> > cache is a pretty good default to use.
> > 
> >> This
> >> should be the cost of allocating a new page, that can contain, multiple
> >> Once the page is in, a new object should be quite cheap to come up with.
> > 
> 
> Indeed. More on this in the next paragraph...

I'm not sure what you are trying to say here. Are you saying that
you think that the IO cost for replacing a slab cache object doesn't
matter?

> > It's not the cost of allocating the page (a couple of microseconds)
> > that is being considered - it the 3-4 orders of magnitude worse cost
> > of reading the object from disk (could be 20ms). The slab/page
> > allocation is lost in the noise compared to the time it takes to
> > fill the page cache page with data or a single slab object.
> > Essentially, slab pages with multiple objects in them are much more
> > expensive to replace in the cache than a page cache page....
> > 
> >> This is a very wild thought, but now that I am diving deep in the
> >> shrinker API, and seeing things like this:
> >>
> >> if (reclaim_state) {
> >>     sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> >>     reclaim_state->reclaimed_slab = 0;
> >> }
> > 
> > That's not part of the shrinker - that's part of the vmscan
> > code, external to the shrinker infrastructure. It's getting
> > information back from the slab caches behind the shrinkers, and it's
> > not the full picture because many shrinkers are not backed by slab
> > caches. It's a work around for not not having accurate feedback from
> > the shrink_slab() code about how many pages were freed.
> > 
> I know it is not part of the shrinkers, and that is precisely my point.
> vmscan needs to go through this kinds of hacks because our API is not
> strong enough to just give it back the answer that matters to the caller.

What matters is that the slab caches are shrunk in proportion to the
page cache. i.e. balanced reclaim. For dentry and inode caches, what
matters is the number of objects reclaimed because the shrinker
algorithm balances based on the relative cost of object replacement
in the cache.

e.g. if you have 1000 pages in the page LRUs, and 1000 objects in
the dentry cache, each takes 1 IO to replace, then if you reclaim 2
pages from the page LRUs and 2 pages from the dentry cache, it will
take 2 IOs to replace the pages in the LRU, but 36 IOs to replace
the objects in the dentry cache that were reclaimed.

This is why the shrinker balances "objects scanned" vs "LRU pages
scanned" - it treats each page as an object and the shrinker relates
that to the relative cost of objects in the slab cache being
reclaimed. i.e. the focus is on keeping a balance between caches,
not reclaiming an absolute number of pages.

Note: I'm not saying this is perfect, what I'm trying to do is let
you know the "why" behind the current algorithm. i.e. why it mostly
works and why ignoring the principles behind why it works is going
fraught with danger...

> > Essentially, the problem is an impedance mismatch between the way
> > the LRUs are scanned/balanced (in pages) and slab caches are managed
> > (by objects). That's what needs unifying...
> > 
> So read my statement again, Dave: this is precisely what I am advocating!

Not clearly enough, obviously :/

> The fact that you are so more concerned with bringing the dentries back
> from disk is just an obvious consequence of your FS background.

No, it's a solution ito a systemic problem that Andrew Morton
identified way back in the 2.5 days that resulting in him implenting
the shrinker infrastructure to directly control the sizes of the
inode and dentry caches. IOWs, the historic reason for having the
shrinkers is to balance dentry/inode caches against the page cache
size to solve performance problems related to cache imbalance
problems.

> The
> problem I was more concerned, is when a user needs to allocate a page
> for whatever reason. We're short on pages, and then we shrink. But the
> shrink gives us nothing. If this is a user-page driven workload, it
> should be better to do this, than to get rid of user pages - which we
> may end up doing if the shrinkers does not release enough pages. This is
> in contrast with a dcache-driven workload, where what you are saying
> makes total sense.

What you are ignoring is that the dcache is a global resource.
trashing the dcache because a user want anonymous memory will affect
system/memcg wide performance degradation, instead of just the
single user application being slowed down. It is preferable that the
user application demanding memory is penalised, not the whole
system.

> In fact, those goals are so orthogonal, that I wonder if it wouldn't be
> worth it to introduce some kind of pressure measurement to determine if
> the workload is kernel-driven or userpages-driven.

Dentry cache growth is user driven - the kernel doesn't walk
directories or open files itself - and is a direct representation of
the recent set of the userspace directory/file accesses. So the
whole concept of "treat the dcache as a kernel driven cache" is
fundamentally flawed.

> I am right now experimenting with something (dcache only for starters)
> like this (simplified version):
> 
> static void dentry_lru_add(struct dentry *dentry)
> {
>         if (!dentry->score) {
>                 struct page *page;
>                 page = virt_to_head_page(dentry);
>                 page->slab_score_page += score;
>                 dentry->score = score();
>         }
> }
> 
> static void __dentry_lru_del(struct dentry *dentry)
> {
>         struct page *page;
>         page = virt_to_head_page(dentry);
> 
>         dentry->d_flags &= ~DCACHE_SHRINK_LIST;
>         page->slab_score_page += dentry->score();
> }

slab_score_page only ever increases according to this code.... :/

> score decreases as the time passes.  So if a page has a very large
> score, it means that it is likely to have a lot of objects that
> are somewhat old.

And if score decreases over time, then a small score means the
entries on the page are older than a page with a large score, right?

> When we scan, we start from them. If this is a purely kernel driven
> workload, we only delete the objects that are, itself, old. If this is a
> user-driven workload, we try to take down the rest as well. With a
> grey-area in the middle, maybe...
> 
> In fact, this may (I am not bold enough to state anything as certain at
> this point!!)  free a lot *less* dentries than the normal shrinkers

The question is this: does it free the right ones? If you have a
single hot dentry on a page full of cold ones, freeing that hot
dentry is the wrong thing to do as it will immediately trigger IO to
bring it back into cache. This is the really problem with page based
slab reclaim....

> would, while still being able to give userspace users what it really
> cares about: pages!

Userspace doesn't care about how the kernel reclaims, tracks or
accounts for memory. All it cares about is having a memory
allocation succeed or fail quickly.

> The big advantage of scanning through pages, is that it becomes trivial
> to either walk it per-zone, or per-memcg, since you can trivially derive
> the set of pages that belong to each of them.

But you really don't have a viable LRU ordering because pages with
mixed hot/cold objects don't look that much different from pages
with all mildly cold objects...

> The determination of user vs kernel driven workloads can also be done
> per-{memcg,zone}. So we can conceivably have a {memcg,zone} that is
> receiving a lot of kernel-objects pressure, and another that is
> receiving a lot of user-driven pressure, due to the different nature of
> their workloads. We would shrink them differently, in this case.

I see that whole concept as fundamentally flawed. Maybe I
misunderstand what you mean by "kernel vs user driven worklaods" as
you haven't really defined what you mean, but I still see the idea
as treating kernel vs userspace driven memory allocation differently
as a vector for insanity. And, in most cases, ensuring kernel memory
allocation succeeds is far, far more important than userspace memory
allocation, so I think that even on this basis this is a
non-starter....

> >> Also, if we are seeing pressure from someone requesting user pages, what
> >> good does it make to free, say, 35 Mb of memory, if this means we are
> >> freeing objects across 5k different pages, without actually releasing
> >> any of them? (still is TBD if this is a theoretical problem or a
> >> practical one). It would maybe be better to free objects that are
> >> moderately hot, but are on pages dominated by cold objects...
> > 
> > Yup, that's a problem, but now you're asking shrinker
> > implementations to know  in great detail the physical locality of
> > object and not just the temporal locality.  the node-aware LRU list
> > does this at a coarse level, but to do page based reclaim you need
> > ot track pages in SL*B that contain unreferenced objects as those
> > are the only ones that can be reclaimed.
> > 
> Yes, this is the part that I am still struggling with. What I am
> currently trying is to come up with a stable SL*B api that will allow us
> to know which objects are alive in a given page, and how to walk them.
> 
> Combined with the score mechanism above, we can define minimum scores
> for an object to be freed - which relates to how long ago it was marked
> - and then free the objects in the page that has a high enough score. If
> pressure is kept, we lower the threshold. Do this n times.
> 
> > If you have no pages with unreferenced objects, then you don't make
> > progress and you have to fall back to freeing unreferenced objects
> > from random pages. ANd under most workloads that aren't benchmarks,
> > slab object population ends up with little correlation between
> > physical and temporal locality. Hence this is the norm rather than
> > the exception..
> > 
> 
> More or less.
> 
> Last time I counted, a dentry had 248 bytes. Let's say 256 for
> simplicity. So each page will have around 16 dentries (it is actually
> usually a bit less than that, due to slab metadata).

IIRC, it's 192 bytes. It was carefully constructed to be exactly
3 cachelines in size for x86_64 when you don't have any lock
debugging turned on.


> In a normal shrink, how many objects do you expect to free ? At some
> point, as your scan rate grows, the likelihood of having a bunch of them
> in the same page increases, even if not all objects in that page are
> cold.  But still, we can easily scan the objects in a page with a high
> score, and find a bunch of objects that are cold. If it is a

How are you scanning those pages?

> >>> 	- add new count and scan operations for caches that are
> >>> 	  based on memory used, not object counts
> >>> 		- allows us to use the same count/scan algorithm for
> >>> 		  calculating how much pressure to put on caches
> >>> 		  with variable size objects.
> >>
> >> IOW, pages.
> > 
> > Not necessarily - if we are going to deal with multiple objects in a
> > page as well as multi-page objects, then pages are unable to express
> > the full range of possibilities. We may as well use byte counts at
> > this point, expecially when you consider page sizes differ on
> > different platforms...
> > 
> 
> bytecounts by themselves doesn't seem that bad to me. It also makes
> sense. But I would still note that for that memory to be useful, it must
> be in consumable units. What is a consumable unit depends on where the
> pressure comes from.
> 
> An object is a consumable unit, so if the pressure is coming from the
> dentry cache, it is great to free 4000 consumable units - objects. If it
> is coming from the page cache, it would be better to free as many pages
> as we can, because that is what can be used.

For the page cache, the consumable unit is a page. This is what I've
been trying to say - you're looking at the page cache as pages, not
as consumable units. There's nothing special about the page cache,
it's just full of objects that are a single page in size, and so a
generic shrinker algorithms can be applied to them just as easily as
the dentry cache...



> >>> My care factor mostly ends here, as it will allow XFS to corectly
> >>> balance the metadata buffer cache (variable size objects) against the
> >>> inode, dentry and dquot caches which are object based. The next
> >>> steps that I'm about to give you are based on some discussions with
> >>> some MM people over bottles of red wine, so take it with a grain of
> >>> salt...
> >>>
> >>> 	- calculate a "pressure" value for each cache controlled by a
> >>> 	  shrinker so that the relative memory pressure between
> >>> 	  caches can be compared. This allows the shrinkers to bias
> >>> 	  reclaim based on where the memory pressure is being
> >>> 	  generated
> >>>
> >>
> >> Ok, if a cache is using a lot of memory, this would indicate it has the
> >> dominant workload, right?
> > 
> > Not necessarily. Someone might jus thave run a find across their
> > filesystem, and that is where all the pressure is coming from. In
> > this case, you don't want that memory prssure to toss out all the
> > other caches. I suspect that the "pressure" measure is going to need
> > to take into account cache hit rates to work properly...
> > 
> Indeed you are right, memory usage is a very bad estimate. Let me stand
> corrected: memory usage over a sliding time window, or something like
> the such.

/me handwaves about how it might work.

This will need a lot of active research, so coming up with something
that works without heuristics is something that is way beyond what
I'm considering right now.

> >> Should we free from it, or should we free from
> >> the others, so this ones gets the pages it needs?
> > 
> > That's the million dollar question...
> 
> For god's sake, it's time we stop this bullshit of telling people
> "that's the million dollar question". Those fucktards are printing money
> like crazy, so we need to correct it for inflation, at the very least.
> 
> This is the USD 1,289,648.82 question, using Y2K as a baseline.

:)

> 
> 
> >>> 	- start grouping shrinkers into a heirarchy, allowing
> >>> 	  related shrinkers (e.g. all the caches in a memcg) to be
> >>> 	  shrunk according resource limits that can be placed on the
> >>> 	  group. i.e. memory pressure is proportioned across
> >>> 	  groups rather than many individual shrinkers.
> >>>
> >> pages are already grouped like that!
> > 
> > But shrinkers and slab caches are not.
> > 
> > Besides, once you start grouping shrinkers, why should we treat the
> > page LRU list scanning any differently from any other cache that has
> > a shrinker? 
> > 
> 
> Dunno. Should we?

Another USD 1,289,648.82 question ;)

We currently base all our memory reclaim decisions on contents of
the LRU lists without taking into account the various other caches
in the system. I think we really should be taking into account the
usages of all the caches in the system before deciding exactly what
we should be trying to reclaim and where we reclaim from.

That kind of implies a level playing field for all caches, rather
than the primary (page LRUs) vs secondary (shrinkers) reclaim
architecture we have now. Given the observation we could treat the
page LRUs as an object based shrinker implementation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2012-11-29 19:02 ` [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Andi Kleen
@ 2012-11-29 22:09   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-11-29 22:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: glommer, linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, Nov 29, 2012 at 11:02:24AM -0800, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> >
> > Comments, thoughts and flames all welcome.
> 
> Doing the reclaim per CPU sounds like a big change in the VM balance. 

It's per node, not per CPU. And AFAICT, it hasn't changed the
balance of page cache vs inode/dentry caches under general, global
workloads at all.

> Doesn't this invalidate some zone reclaim mode settings?

No, because zone reclaim is per-node and the shrinkers now can
reclaim just from a single node. i.e. the behaviour is now better
suited to the aims of zone reclaim which is to free memory from a
single, targetted node. Indeed, I removed a hack in the zone reclaim
code that sprayed slab reclaim across the entire machine until
sufficient objects had been freed from the target node....

> How did you validate all this?

fakenuma setups, various workloads that generate even dentry/slab
cache loadings across all nodes, adding page cache pressure on a
single node, watching slab reclaim from a single node. that sort of
thing.

I haven't really done any performance testing other than "not
obviously slower". There's no point optimising anything before
there's any sort of agreement as to whether this is the right
approach to take or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/19] shrinker: convert superblock shrinkers to new API
  2012-11-27 23:14 ` [PATCH 05/19] shrinker: convert superblock shrinkers to new API Dave Chinner
@ 2012-12-20 11:06   ` Glauber Costa
  2012-12-21  1:46     ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2012-12-20 11:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On 11/28/2012 03:14 AM, Dave Chinner wrote:
> +static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	struct super_block *sb;
> +	long	total_objects = 0;
> +
> +	sb = container_of(shrink, struct super_block, s_shrink);
> +
> +	if (!grab_super_passive(sb))
> +		return -1;
> +


You're missing the GFP_FS check here. This leads to us doing all the
counting only to find out later, in the scanner, that we won't be able
to free it. Better exit early.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
@ 2012-12-20 11:21   ` Glauber Costa
  2012-12-21  1:54     ` Dave Chinner
  2013-01-16 19:21   ` Glauber Costa
  1 sibling, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2012-12-20 11:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On 11/28/2012 03:14 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have an LRU list API, we can start to enhance the
> implementation.  This splits the single LRU list into per-node lists
> and locks to enhance scalability. Items are placed on lists
> according to the node the memory belongs to. To make scanning the
> lists efficient, also track whether the per-node lists have entries
> in them in a active nodemask.
> 

I think it is safe to assume that these functions could benefit from
having more metadata available for them when they run.

Let's say for instance that a hypothetical person, for some unknown
reasons, comes with the idea of replicating those lists transparently
per memcg.

In this case, it is very useful to know which memcg drives the current
call. In general, the struct shrink_control already contains a lot of
data that we use to drive the process. Wouldn't it make sense to also
pass shrink_control as data to those lists as well?

The only drawback of this, is that it would tie it to the shrinking
process. I am not sure if this is a concern, but it if is, maybe we
could replace things like :

+static long
+list_lru_walk_node(
+	struct list_lru		*lru,
+	int			nid,
+	list_lru_walk_cb	isolate,
+	void			*cb_arg,
+	long			*nr_to_walk)
+{

with

+static long
+list_lru_walk_node(
+	struct list_lru		*lru,
+       struct something_like_shrink_control_not_shrink_control *a)
+{

This way we can augment the data available for the interface, for
instance, passing the memcg context, without going patching all the callers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (19 preceding siblings ...)
  2012-11-29 19:02 ` [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Andi Kleen
@ 2012-12-20 11:45 ` Glauber Costa
  2012-12-21  2:50   ` Dave Chinner
  2013-01-21 16:08 ` Glauber Costa
  21 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2012-12-20 11:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On 11/28/2012 03:14 AM, Dave Chinner wrote:
> Hi Glauber,
> 
> Here's a working version of my patchset for generic LRU lists and
> NUMA-aware shrinkers.
> 
> There are several parts to this patch set. The NUMA aware shrinkers
> are based on having a generic node-based LRU list implementation,
> and there are subsystems that need to be converted to use these
> lists as part of the process. There is also a long overdue change to
> the shrinker API to give it separate object count and object scan
> callbacks, getting rid of the magic and confusing "nr_to_scan = 0"
> semantics.
> 
> First of all, the patch set is based on a current 3.7-rc7 tree with
> the current xfs dev tree merged into it [can be found at
> git://oss.sgi.com/xfs/xfs]. That's because there are lots of XFS
> changes in the patch set, and theres no way I'm going to write them
> a second time in a couple of weeks when the current dev tree is
> merged into 3.8-rc1....
> 
> So, where's what the patches do:
> 
> [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu
> [PATCH 02/19] dentry: move to per-sb LRU locks
> [PATCH 03/19] dcache: remove dentries from LRU before putting on
> 
> These three patches are preparation of the dcache for moving to the
> generic LRU list API. it basically gets rid of the global dentry LRU
> lock, and in doing so has to avoid several creative abuses of the
> lru list detection to allow dentries on shrink lists to be still
> magically be on the LRU list. The main change here is that now
> dentries on the shrink lists *must* have the DCACHE_SHRINK_LIST flag
> set and be entirely removed from the LRU before being disposed of.
> 
> This is probably a good cleanup to do regardless of the rest of the
> patch set because it removes a couple of landmines in
> shrink_dentry_list() that took me a while to work out...
> 
> [PATCH 04/19] mm: new shrinker API
> [PATCH 05/19] shrinker: convert superblock shrinkers to new API
> 
> These introduce the count/scan shrinker API, and for testing
> purposes convert the superblock shrinker to use it before any other
> changes are made. This gives a clean separation of counting the
> number of objects in a cache for pressure calculations, and the act
> of scanning objects in an attempt to free memory. Indeed, the
> scan_objects() callback now returns the number of objects freed by
> the scan instead of having to try to work out whether any progress
> was made by comparing absolute counts.
> 
> This is also more efficient as we don't have to count all the
> objects in a cache on every scan pass. It is now done once per
> shrink_slab() invocation to calculate how much to scan, and we get
> direct feedback on how much gets reclaimed in that pass. i.e. we get
> reliable, accurate feedback on shrinker progress.
> 
> [PATCH 06/19] list: add a new LRU list type
> [PATCH 07/19] inode: convert inode lru list to generic lru list
> [PATCH 08/19] dcache: convert to use new lru list infrastructure
> 
> These add the generic LRU list API and infrastructure and convert
> the inode and dentry caches to use it. This is still just a single
> global list per LRU at this point, so it's really only changing the
> where the LRU implemenation is rather than the fundamental
> algorithm. It does, however, introduce a new method of walking the
> LRU lists and building the dispose list of items for shrinkers, but
> because we are still dealing with a global list the algorithmic
> changes are minimal.
> 
> [PATCH 09/19] list_lru: per-node list infrastructure
> 
> This makes the generic LRU list much more scalable by changing it to
> a {list,lock,count} tuple per node. There are no external API
> changes to this changeover, so is transparent to current users.
> 
> [PATCH 10/19] shrinker: add node awareness
> [PATCH 11/19] fs: convert inode and dentry shrinking to be node
> 
> Adds a nodemask to the struct shrink_control for callers of
> shrink_slab to set appropriately for their reclaim context. This
> nodemask is then passed by the inode and dentry cache reclaim code
> to the generic LRU list code to implement node aware shrinking.
> 
> What this doesn't do is convert the internal shrink_slab() algorithm
> to be node aware. I'm not sure what the best approach is here, but
> it strikes me that it should really be calculating and keeping track
> of scan counts and pressure on a per-node basis. The current code
> seems to work OK at the moment, though.
> 
> [PATCH 12/19] xfs: convert buftarg LRU to generic code
> [PATCH 13/19] xfs: Node aware direct inode reclaim
> [PATCH 14/19] xfs: use generic AG walk for background inode reclaim
> [PATCH 15/19] xfs: convert dquot cache lru to list_lru
> 
> These patches convert all the XFS LRUs and shrinkers to be node
> aware. This gets rid of a lot of hairy, special case code in the
> inode cache shrinker for avoiding concurrent shrinker contention and
> to throttle direct reclaim to prevent premature OOM conditions.
> Removing this code greatly simplifies inode cache reclaim whilst
> reducing overhead and improving performance. In all, it converts
> three separate caches and shrinkers to use the generic LRU lists and
> pass nodemasks around appropriately.
> 
> This is how I've really tested the code - lots of interesting
> filesystem workloads that generate simultaneous slab and page cache
> pressure on VM's with and without fake_numa configs....
> 
> [PATCH 16/19] fs: convert fs shrinkers to new scan/count API
> [PATCH 17/19] drivers: convert shrinkers to new count/scan API
> [PATCH 18/19] shrinker: convert remaining shrinkers to count/scan
> [PATCH 19/19] shrinker: Kill old ->shrink API.
> 
> These last three patches convert all the other shrinker
> implementations to the new count/scan API.  The fs, android and dm
> shrinkers are pretty well behaved and are implemented as is expected
> for there intended purposes. The driver and staging code, however,
> is basically a bunch of hacks to try to do something resembling
> reclaim when a shrinker tells it there is memory pressure. Looking
> at all the driver and staging code is not an exercise I recommend if
> you value your eyes and/or your sanity.
> 
> I haven't even tried to compile this on a CONFIG_SMP=n
> configuration, nor have I done extensive allmod style build tests
> and it's only been built and tested on x86-64. That said, apart from
> CONFIG_SMP=n, I don't see there being any major problems here.
> 
> There's still a bunch of cleanup work needed. e.g. the LRU list
> walk/isolation code needs to use enums for the isolate callback
> return code, there needs to be a generic list_lru_for_each() style
> function for walking all the objects in the cache (which will allow
> the list_lru structures to be used for things like the per-sb inode
> list). Indeed, even the name "list_lru" is probably something that
> should be changed - I think the list has become more of a general
> per-node list than it's initial functionality as a scalable LRU list
> implementation and I can see uses for it outside of LRUs...
> 
> Comments, thoughts and flames all welcome.
> 

I like the general idea, and after a small PoC on my side, I can say it
can at least provide us with a good and sound route to solve the
targetted memcg shrinking problem.

I've already provided you some small feedback about the interface in the
specific patches.

But on a broader sense: The only thing that still bothers me personally
(meaning: it created particular pain points), is the very loose coupling
between all the elements involved in the shrinking process:

1) the shrinker, always present
2) the lru, usually present
3) the cache, usually present, specially when there is an LRU.

I of course understand that they are not always present, and when they
are, they are not in a 1:1 relation.

But still, it would be nice to be able to register them to one another,
so that we can easily answer things like:

"Given a set of caches, what is the set of shrinkers that will shrink them?"

"What are the lrus that are driven by this shrinker?"

This would allow me to do things like this:

* When a per-memcg cache is created (not all of the caches are
replicated), find the shrinkers that can shrink them.

* For each shrinker, also replicate the LRUs that are driven by them.

Does that make any sense to you ?





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/19] shrinker: convert superblock shrinkers to new API
  2012-12-20 11:06   ` Glauber Costa
@ 2012-12-21  1:46     ` Dave Chinner
  2012-12-21 10:17       ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-12-21  1:46 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, Dec 20, 2012 at 03:06:42PM +0400, Glauber Costa wrote:
> On 11/28/2012 03:14 AM, Dave Chinner wrote:
> > +static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
> > +{
> > +	struct super_block *sb;
> > +	long	total_objects = 0;
> > +
> > +	sb = container_of(shrink, struct super_block, s_shrink);
> > +
> > +	if (!grab_super_passive(sb))
> > +		return -1;
> > +
> 
> 
> You're missing the GFP_FS check here. This leads to us doing all the
> counting only to find out later, in the scanner, that we won't be able
> to free it. Better exit early.

No, I did that intentionally.

The shrinker has a method of deferring work from one invocation to
another - the shrinker->nr_in_batch variable. This is intended to be
used to ensure that a shrinker does the work it is supposed to, even
if it can't do the work immediately due to something like a GFP
context mismatch.

The problem with that mechanism right now is that it is not applied
consistently across the shrinkers. Some shrinkers will return a
count whenever nr_to_scan == 0, regardless of the gfp_mask, while
others will immediately return -1.

What this patch set does is make the shrinkers *always* return the
count of objects so the scan count can be calculated, and then let
the actually scanner determine whether progress can be made. The
result of doing this is that if the scanner cannot make progress,
the work is correctly deferred to the next shrinker invocation that
may be made under a different GFP context.

This is important because when you have a workload that involves a
lot of filesytsem modifications, the number of GFP_NOFS allocations
greatly outweights GFP_KERNEL allocations. Hence the majority of the
time we try to shrink the filesystem caches, they cannot do any
work. Hence we need the work to be deferred to the next GFP_KERNEL
shrinker invocation so the reclaim of the caches remains in balance.

This is also the reason for "We need to avoid excessive windup on
filesystem shrinkers" limiting of total_scan, so that we don't allow
this deferal to completely trash the caches when so much deferal
happens that the scan count grows to exceed the size of the cache
and we get a GFP_KERNEL reclaim context...

IOWs, for this deferal mechanism to work consistently, we always
need to calculate the amount of work we are supposed to do when the
shrinker is invoked. That means we always need to return the current
count of objects iand calculate the amount of scanning we need to
do. The check in the scan context determines if the work then gets
deferred or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2012-12-20 11:21   ` Glauber Costa
@ 2012-12-21  1:54     ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2012-12-21  1:54 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, Dec 20, 2012 at 03:21:26PM +0400, Glauber Costa wrote:
> On 11/28/2012 03:14 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have an LRU list API, we can start to enhance the
> > implementation.  This splits the single LRU list into per-node lists
> > and locks to enhance scalability. Items are placed on lists
> > according to the node the memory belongs to. To make scanning the
> > lists efficient, also track whether the per-node lists have entries
> > in them in a active nodemask.
> > 
> 
> I think it is safe to assume that these functions could benefit from
> having more metadata available for them when they run.
> 
> Let's say for instance that a hypothetical person, for some unknown
> reasons, comes with the idea of replicating those lists transparently
> per memcg.
> 
> In this case, it is very useful to know which memcg drives the current
> call. In general, the struct shrink_control already contains a lot of
> data that we use to drive the process. Wouldn't it make sense to also
> pass shrink_control as data to those lists as well?

I considered it, but:

> The only drawback of this, is that it would tie it to the shrinking
> process.

and that's exactly what I didn't want to do. Yes, the shrinkers need
to walk the list, but they will not be/are not the only reason we
need to walk lists and isolate items....

> I am not sure if this is a concern, but it if is, maybe we
> could replace things like :
> 
> +static long
> +list_lru_walk_node(
> +	struct list_lru		*lru,
> +	int			nid,
> +	list_lru_walk_cb	isolate,
> +	void			*cb_arg,
> +	long			*nr_to_walk)
> +{
> 
> with
> 
> +static long
> +list_lru_walk_node(
> +	struct list_lru		*lru,
> +       struct something_like_shrink_control_not_shrink_control *a)
> +{
> 
> This way we can augment the data available for the interface, for
> instance, passing the memcg context, without going patching all the callers.

Yes, that is also something I considered. I just never got around to
doing it as I wasn't sure whether the walking interface woul dbe
acceptible in the first place. If we do use the list walk interface
list this, then we shoul definitely encapsulate all the parameters
in a struct something_like_shrink_control_not_shrink_control. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2012-12-20 11:45 ` Glauber Costa
@ 2012-12-21  2:50   ` Dave Chinner
  2012-12-21 10:41     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2012-12-21  2:50 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, Dec 20, 2012 at 03:45:28PM +0400, Glauber Costa wrote:
> On 11/28/2012 03:14 AM, Dave Chinner wrote:
> > Hi Glauber,
> > 
> > Here's a working version of my patchset for generic LRU lists and
> > NUMA-aware shrinkers.
.....
> > There's still a bunch of cleanup work needed. e.g. the LRU list
> > walk/isolation code needs to use enums for the isolate callback
> > return code, there needs to be a generic list_lru_for_each() style
> > function for walking all the objects in the cache (which will allow
> > the list_lru structures to be used for things like the per-sb inode
> > list). Indeed, even the name "list_lru" is probably something that
> > should be changed - I think the list has become more of a general
> > per-node list than it's initial functionality as a scalable LRU list
> > implementation and I can see uses for it outside of LRUs...
> > 
> > Comments, thoughts and flames all welcome.
> > 
> 
> I like the general idea, and after a small PoC on my side, I can say it
> can at least provide us with a good and sound route to solve the
> targetted memcg shrinking problem.
> 
> I've already provided you some small feedback about the interface in the
> specific patches.

*nod*

> But on a broader sense: The only thing that still bothers me personally
> (meaning: it created particular pain points), is the very loose coupling
> between all the elements involved in the shrinking process:
> 
> 1) the shrinker, always present
> 2) the lru, usually present
> 3) the cache, usually present, specially when there is an LRU.
> 
> I of course understand that they are not always present, and when they
> are, they are not in a 1:1 relation.
> 
> But still, it would be nice to be able to register them to one another,
> so that we can easily answer things like:
> 
> "Given a set of caches, what is the set of shrinkers that will shrink them?"
> 
> "What are the lrus that are driven by this shrinker?"
> 
> This would allow me to do things like this:
> 
> * When a per-memcg cache is created (not all of the caches are
> replicated), find the shrinkers that can shrink them.
> 
> * For each shrinker, also replicate the LRUs that are driven by them.
> 
> Does that make any sense to you ?

It certainly does, though I see that as a separate problem to the
one that this patch set solves. i.e. this is an issue related to the
scope and context of a shrinker/LRU couplet, rather than the
implementation of a shrinker/LRU couplet. This patchset addresses
the latter of the two, and I'm pretty sure that I mentioned that the
former was not a problem I am trying to solve yet....

As it is, right now we embed the struct shrinker into the owner
context, and that's how we find the LRU/cache instances that the
shrinker operates on. In the case of per-memcg shrinker
instantiation that fixed relationship does not work.

Further, the struct shrinker has context specific information in it,
like the defered scan count that it carries from one invocation to
the next, so what we end up with is a tightly coupled owner/shrinker
relationship.  That is, a shrinker is really made up of four things:

	- a shrinker definition (set of methods and configuration
	  data)
	- a non-volatile set of data
	- the owner context
	- the LRU/cache to be shrunk

I suspect that a shrinker instance would look something like this:

shrinker_instance {
	non-volatile set of data
	LRU/cache to be shrunk
	pointer to the owner context
	pointer to the shrinker definition
}

But I'm not really sure how to group them sanely, how to know what
shrinkers would need multiple instantiation and when you'd do that
instantiation, or even how an owner context would then do global
operations (e.g empty caches prior to unmount).

I simply don't know what the requirements for such infrastructure
is, so I can't really say much more than this. Hence I think the
first thing to do here is work out and document what such
instantiation, tracking and grouping needs to be able to do before
anything else...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/19] shrinker: convert superblock shrinkers to new API
  2012-12-21  1:46     ` Dave Chinner
@ 2012-12-21 10:17       ` Glauber Costa
  0 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2012-12-21 10:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On 12/21/2012 05:46 AM, Dave Chinner wrote:
> On Thu, Dec 20, 2012 at 03:06:42PM +0400, Glauber Costa wrote:
>> On 11/28/2012 03:14 AM, Dave Chinner wrote:
>>> +static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
>>> +{
>>> +	struct super_block *sb;
>>> +	long	total_objects = 0;
>>> +
>>> +	sb = container_of(shrink, struct super_block, s_shrink);
>>> +
>>> +	if (!grab_super_passive(sb))
>>> +		return -1;
>>> +
>>
>>
>> You're missing the GFP_FS check here. This leads to us doing all the
>> counting only to find out later, in the scanner, that we won't be able
>> to free it. Better exit early.
> 
> No, I did that intentionally.
> 
> The shrinker has a method of deferring work from one invocation to
> another - the shrinker->nr_in_batch variable. This is intended to be
> used to ensure that a shrinker does the work it is supposed to, even
> if it can't do the work immediately due to something like a GFP
> context mismatch.
> 
> The problem with that mechanism right now is that it is not applied
> consistently across the shrinkers. Some shrinkers will return a
> count whenever nr_to_scan == 0, regardless of the gfp_mask, while
> others will immediately return -1.
> 
> What this patch set does is make the shrinkers *always* return the
> count of objects so the scan count can be calculated, and then let
> the actually scanner determine whether progress can be made. The
> result of doing this is that if the scanner cannot make progress,
> the work is correctly deferred to the next shrinker invocation that
> may be made under a different GFP context.
> 
> This is important because when you have a workload that involves a
> lot of filesytsem modifications, the number of GFP_NOFS allocations
> greatly outweights GFP_KERNEL allocations. Hence the majority of the
> time we try to shrink the filesystem caches, they cannot do any
> work. Hence we need the work to be deferred to the next GFP_KERNEL
> shrinker invocation so the reclaim of the caches remains in balance.
> 
> This is also the reason for "We need to avoid excessive windup on
> filesystem shrinkers" limiting of total_scan, so that we don't allow
> this deferal to completely trash the caches when so much deferal
> happens that the scan count grows to exceed the size of the cache
> and we get a GFP_KERNEL reclaim context...
> 
> IOWs, for this deferal mechanism to work consistently, we always
> need to calculate the amount of work we are supposed to do when the
> shrinker is invoked. That means we always need to return the current
> count of objects iand calculate the amount of scanning we need to
> do. The check in the scan context determines if the work then gets
> deferred or not....
> 


While it makes sense, and it is clear to me now, could you document that
to the benefit of future less gifted readers as myself?

Otherwise, should you be bitten by a giant poisonous spider and turn
into a spider-zombie, this knowledge disappears. And you live in
Australia, mate, this is totally possible.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2012-12-21  2:50   ` Dave Chinner
@ 2012-12-21 10:41     ` Glauber Costa
  0 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2012-12-21 10:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

Hey Dave,

On 12/21/2012 06:50 AM, Dave Chinner wrote:
>> But on a broader sense: The only thing that still bothers me personally
>> > (meaning: it created particular pain points), is the very loose coupling
>> > between all the elements involved in the shrinking process:
>> > 
>> > 1) the shrinker, always present
>> > 2) the lru, usually present
>> > 3) the cache, usually present, specially when there is an LRU.
>> > 
>> > I of course understand that they are not always present, and when they
>> > are, they are not in a 1:1 relation.
>> > 
>> > But still, it would be nice to be able to register them to one another,
>> > so that we can easily answer things like:
>> > 
>> > "Given a set of caches, what is the set of shrinkers that will shrink them?"
>> > 
>> > "What are the lrus that are driven by this shrinker?"
>> > 
>> > This would allow me to do things like this:
>> > 
>> > * When a per-memcg cache is created (not all of the caches are
>> > replicated), find the shrinkers that can shrink them.
>> > 
>> > * For each shrinker, also replicate the LRUs that are driven by them.
>> > 
>> > Does that make any sense to you ?
> It certainly does, though I see that as a separate problem to the
> one that this patch set solves. i.e. this is an issue related to the
> scope and context of a shrinker/LRU couplet, rather than the
> implementation of a shrinker/LRU couplet. This patchset addresses
> the latter of the two, and I'm pretty sure that I mentioned that the
> former was not a problem I am trying to solve yet....

Yes. And now I am pretty sure that I mentioned that this is actually a
problem!

Really, the only major change I would like to see made in this patchset
is the provision of more context to the lru functions. It is fine if it
is a lru-specific context.

All the rest is details.

> 
> As it is, right now we embed the struct shrinker into the owner
> context, and that's how we find the LRU/cache instances that the
> shrinker operates on. In the case of per-memcg shrinker
> instantiation that fixed relationship does not work.

Yes, and there is also another problem that arises from it:

the superblock is the owner, but superblocks are no memcg's business. It
is caches that we track, so it would be good to determine which
shrinkers are responsible for each caches -> a n:m relationship.

Those are the ones we need to run per-memcg. Otherwise, we need to go
mess with them all.

> 
> Further, the struct shrinker has context specific information in it,
> like the defered scan count that it carries from one invocation to
> the next, so what we end up with is a tightly coupled owner/shrinker
> relationship.  That is, a shrinker is really made up of four things:
> 
> 	- a shrinker definition (set of methods and configuration
> 	  data)
> 	- a non-volatile set of data
> 	- the owner context
> 	- the LRU/cache to be shrunk
> 
> I suspect that a shrinker instance would look something like this:
> 
> shrinker_instance {
> 	non-volatile set of data
> 	LRU/cache to be shrunk
> 	pointer to the owner context
> 	pointer to the shrinker definition
> }
> 
> But I'm not really sure how to group them sanely, how to know what
> shrinkers would need multiple instantiation and when you'd do that
> instantiation, or even how an owner context would then do global
> operations (e.g empty caches prior to unmount).

Ok, so right now, what my PoC does, is to leave the shrinkers mostly
alone (I just have a static flag to indicate that this is a possible
per-memcg candidate).

The LRUs are all copied over (a bit of a waste, but this is still a
PoC), and whenever the shrinker is called, we trust that the LRUs will
be available, and then using the memcg pointer, derive the
memcg-specific LRUs from the global, original LRUs.

It is still unsolved for the non-volatile set of data belonging to the
shrinker.

So the full solution, would be to create a isolated copy of all that.

"Which shrinkers to copy" is just an item in the wishlist. We don't
replicate all caches because some of them will truly never be used
in a typical container, but most of the shrinkers are fs-related anyway,
and we'll always touch the filesystems. However, we will likely not
touch *all* the filesystems, so a lot of them will just be there,
hanging around and wasting memory.


> 
> I simply don't know what the requirements for such infrastructure
> is, so I can't really say much more than this. Hence I think the
> first thing to do here is work out and document what such
> instantiation, tracking and grouping needs to be able to do before
> anything else...
> 
Yes, let's do it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
  2012-12-20 11:21   ` Glauber Costa
@ 2013-01-16 19:21   ` Glauber Costa
  2013-01-16 22:55     ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-16 19:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On 11/27/2012 03:14 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have an LRU list API, we can start to enhance the
> implementation.  This splits the single LRU list into per-node lists
> and locks to enhance scalability. Items are placed on lists
> according to the node the memory belongs to. To make scanning the
> lists efficient, also track whether the per-node lists have entries
> in them in a active nodemask.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/list_lru.h |   14 ++--
>  lib/list_lru.c           |  160 +++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 129 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 3423949..b0e3ba2 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -8,21 +8,23 @@
>  #define _LRU_LIST_H 0
>  
>  #include <linux/list.h>
> +#include <linux/nodemask.h>
>  
> -struct list_lru {
> +struct list_lru_node {
>  	spinlock_t		lock;
>  	struct list_head	list;
>  	long			nr_items;
> +} ____cacheline_aligned_in_smp;
> +
> +struct list_lru {
> +	struct list_lru_node	node[MAX_NUMNODES];
> +	nodemask_t		active_nodes;
>  };
>  
MAX_NUMNODES will default to 1 << 9, if I'm not mistaken. Your
list_lru_node seems to be around 32 bytes on 64-bit systems (128 with
debug). So we're talking about 16k per lru.
The superblocks only, are present by the dozens even in a small system,
and I believe the whole goal of this API is to get more users to switch
to it. This can easily use up a respectable bunch of megs.

Isn't it a bit too much ?

I am wondering if we can't do better in here and at least allocate+grow
according to the actual number of nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-16 19:21   ` Glauber Costa
@ 2013-01-16 22:55     ` Dave Chinner
  2013-01-17  0:35       ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2013-01-16 22:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On Wed, Jan 16, 2013 at 11:21:44AM -0800, Glauber Costa wrote:
> On 11/27/2012 03:14 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have an LRU list API, we can start to enhance the
> > implementation.  This splits the single LRU list into per-node lists
> > and locks to enhance scalability. Items are placed on lists
> > according to the node the memory belongs to. To make scanning the
> > lists efficient, also track whether the per-node lists have entries
> > in them in a active nodemask.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/linux/list_lru.h |   14 ++--
> >  lib/list_lru.c           |  160 +++++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 129 insertions(+), 45 deletions(-)
> > 
> > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > index 3423949..b0e3ba2 100644
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -8,21 +8,23 @@
> >  #define _LRU_LIST_H 0
> >  
> >  #include <linux/list.h>
> > +#include <linux/nodemask.h>
> >  
> > -struct list_lru {
> > +struct list_lru_node {
> >  	spinlock_t		lock;
> >  	struct list_head	list;
> >  	long			nr_items;
> > +} ____cacheline_aligned_in_smp;
> > +
> > +struct list_lru {
> > +	struct list_lru_node	node[MAX_NUMNODES];
> > +	nodemask_t		active_nodes;
> >  };
> >  
> MAX_NUMNODES will default to 1 << 9, if I'm not mistaken. Your
> list_lru_node seems to be around 32 bytes on 64-bit systems (128 with
> debug). So we're talking about 16k per lru.

*nod*

It is relatively little compared to the number of inodes typically
on a LRU.

> The superblocks only, are present by the dozens even in a small system,
> and I believe the whole goal of this API is to get more users to switch
> to it. This can easily use up a respectable bunch of megs.
> 
> Isn't it a bit too much ?

Maybe, but for active superblocks it only takes a handful of cached
inodes to make this 16k look like noise, so I didn't care. Indeed, a
typical active filesystem could be consuming gigabytes of memory in
the slab, so 16k is a tiny amount of overhead to track this amount
of memory more efficiently.

Most other LRU/shrinkers are tracking large objects and only have a
single LRU instance machine wide. Hence the numbers arguments don't
play out well in favour of a more complex, dynamic solution for
them, either. Sometimes dumb and simple is the best approach ;)

> I am wondering if we can't do better in here and at least allocate+grow
> according to the actual number of nodes.

We could add hotplug notifiers and grow/shrink the node array as
they get hot plugged, but that seems unnecessarily complex given
how rare such operations are.

If superblock proliferation is the main concern here, then doing
somethign as simple as allowing filesystems to specify they want
numa aware LRU lists via a mount_bdev() flag would solve this
problem. If the flag is set, then full numa lists are created.
Otherwise the LRU list simply has a "single node" and collapses all node
IDs down to 0 and ignores all NUMA optimisations...

That way the low item count virtual filesystems like proc, sys,
hugetlbfs, etc won't use up memory, but filesytems that actually
make use of NUMA awareness still get the more expensive, scalable
implementation. Indeed, any subsystem that is not performance or
location sensitive can use the simple single list version, so we can
avoid overhead in that manner system wide...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-16 22:55     ` Dave Chinner
@ 2013-01-17  0:35       ` Glauber Costa
  2013-01-17  4:22         ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-17  0:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal


>> The superblocks only, are present by the dozens even in a small system,
>> and I believe the whole goal of this API is to get more users to switch
>> to it. This can easily use up a respectable bunch of megs.
>>
>> Isn't it a bit too much ?
> 
> Maybe, but for active superblocks it only takes a handful of cached
> inodes to make this 16k look like noise, so I didn't care. Indeed, a
> typical active filesystem could be consuming gigabytes of memory in
> the slab, so 16k is a tiny amount of overhead to track this amount
> of memory more efficiently.
> 
> Most other LRU/shrinkers are tracking large objects and only have a
> single LRU instance machine wide. Hence the numbers arguments don't
> play out well in favour of a more complex, dynamic solution for
> them, either. Sometimes dumb and simple is the best approach ;)
> 

Being dumb and simple myself, I'm of course forced to agree.
Let me give you more context so you can understand my deepest fears better:

>> I am wondering if we can't do better in here and at least allocate+grow
>> according to the actual number of nodes.
> 
> We could add hotplug notifiers and grow/shrink the node array as
> they get hot plugged, but that seems unnecessarily complex given
> how rare such operations are.
> 
> If superblock proliferation is the main concern here, then doing
> somethign as simple as allowing filesystems to specify they want
> numa aware LRU lists via a mount_bdev() flag would solve this
> problem. If the flag is set, then full numa lists are created.
> Otherwise the LRU list simply has a "single node" and collapses all node
> IDs down to 0 and ignores all NUMA optimisations...
> 
> That way the low item count virtual filesystems like proc, sys,
> hugetlbfs, etc won't use up memory, but filesytems that actually
> make use of NUMA awareness still get the more expensive, scalable
> implementation. Indeed, any subsystem that is not performance or
> location sensitive can use the simple single list version, so we can
> avoid overhead in that manner system wide...
> 

Deepest fears:

1) snakes.

2) It won't surprise you to know that I am adapting your work, which
provides a very sane and helpful API, to memcg shrinking.

The dumb and simple approach in there is to copy all lrus that are
marked memcg aware at memcg creation time. The API is kept the same,
but when you do something like list_lru_add(lru, obj), for instance, we
derive the memcg context from obj and relay it to the right list.

Statically allocating those arrays for all possible combinations is
basically our way to guarantee that the lookups will be cheap.
Otherwise, we always need to ask the question: "is dentries from this
superblock currently billed to this memcg?", and if yes, allocate memory
in list_lru_add or other steady state operations.

But of course, if we copy all the per node stuff as well, we are
basically talking about that amount of memory per memcg, which means
each memcg will now have an extra overhead of some megs, which is way
too big.

Using online nodes+grow was one of the approaches we considered.
Making the memcg list single-noded while keeping the global lists
node-aware also works for us.
Your current suggestion of going per-node only in the performance
critical filesystems could also possibly work, provided this count is
expected to be small.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-17  0:35       ` Glauber Costa
@ 2013-01-17  4:22         ` Dave Chinner
  2013-01-17 18:21           ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2013-01-17  4:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On Wed, Jan 16, 2013 at 04:35:43PM -0800, Glauber Costa wrote:
> 
> >> The superblocks only, are present by the dozens even in a small system,
> >> and I believe the whole goal of this API is to get more users to switch
> >> to it. This can easily use up a respectable bunch of megs.
> >>
> >> Isn't it a bit too much ?
> > 
> > Maybe, but for active superblocks it only takes a handful of cached
> > inodes to make this 16k look like noise, so I didn't care. Indeed, a
> > typical active filesystem could be consuming gigabytes of memory in
> > the slab, so 16k is a tiny amount of overhead to track this amount
> > of memory more efficiently.
> > 
> > Most other LRU/shrinkers are tracking large objects and only have a
> > single LRU instance machine wide. Hence the numbers arguments don't
> > play out well in favour of a more complex, dynamic solution for
> > them, either. Sometimes dumb and simple is the best approach ;)
> > 
> 
> Being dumb and simple myself, I'm of course forced to agree.

:)

> Let me give you more context so you can understand my deepest fears better:
> 
> >> I am wondering if we can't do better in here and at least allocate+grow
> >> according to the actual number of nodes.
> > 
> > We could add hotplug notifiers and grow/shrink the node array as
> > they get hot plugged, but that seems unnecessarily complex given
> > how rare such operations are.
> > 
> > If superblock proliferation is the main concern here, then doing
> > somethign as simple as allowing filesystems to specify they want
> > numa aware LRU lists via a mount_bdev() flag would solve this
> > problem. If the flag is set, then full numa lists are created.
> > Otherwise the LRU list simply has a "single node" and collapses all node
> > IDs down to 0 and ignores all NUMA optimisations...
> > 
> > That way the low item count virtual filesystems like proc, sys,
> > hugetlbfs, etc won't use up memory, but filesytems that actually
> > make use of NUMA awareness still get the more expensive, scalable
> > implementation. Indeed, any subsystem that is not performance or
> > location sensitive can use the simple single list version, so we can
> > avoid overhead in that manner system wide...
> > 
> 
> Deepest fears:
> 
> 1) snakes.

Snakes are merely poisonous. Drop Bears are far more dangerous :P

> 2) It won't surprise you to know that I am adapting your work, which
> provides a very sane and helpful API, to memcg shrinking.
> 
> The dumb and simple approach in there is to copy all lrus that are
> marked memcg aware at memcg creation time. The API is kept the same,
> but when you do something like list_lru_add(lru, obj), for instance, we
> derive the memcg context from obj and relay it to the right list.

At which point, you don't want the overhead of per-node lists.

The problem I see here is that someone might still need the
scalability of the per-node lists. If someone runs a large memcg in terms
of CPU and memory, then we most definitely are going to need to
retain per-node lists regardless of the fact that the workload is
running in a constrained environment. And if you are running a mix
of large and small containers, then one static solution is not going
to cut it for some workload.

This is a problem that superblock contexts don't care about - they
are global by their very nature. Hence I'm wondering if trying to
fit these two very different behaviours into the one LRU list is
the wrong approach.

Consider this: these patches give us a generic LRU list structure.
It currently uses a list_head in each object for indexing, and we
are talking about single LRU lists because of this limitation and
trying to build infrastructure that can support this indexing
mechanism.

I think that all of thses problems go away if we replace the
list_head index in the object with a "struct lru_item" index. To
start with, it's just a s/list_head/lru_item/ changeover, but from
there we can expand.

What I'm getting at is that we want to have multiple axis of
tracking and reclaim, but we only have a single axis for tracking.
If the lru_item grew a second list_head called "memcg_lru", then
suddenly the memcg LRUs can be maintained separately to the global
(per-superblock) LRU. i.e.:

struct lru_item {
	struct list_head global_list;
	struct list_head memcg_list;
}

And then you can use whatever tracking structure you want for a
memcg LRU list. Indeed, this would allow per-node lists for the
global LRU, and whatever LRU type is appropriate for the memcg using
the object (e.g. single list for small memcgs, per-node for large
memcgs)....

i.e. rather than trying to make the infrastructure jump through hoops
to only have one LRU index per object, have a second index that
allows memcg's to have a separate LRU index and a method for the
global LRU structure to find them. This woul dallow memcg specific
shrinker callouts switch to the memcg LRU rather than the global LRU
and operate on that. That way we still only instantiate a single
LRU/shrinker pair per cache context, but the memcg code doesn't need
to duplicate the entire LRU infrastructure into every memcg that
contains that type of object for that cache context....

/me stops rambling....

> Your current suggestion of going per-node only in the performance
> critical filesystems could also possibly work, provided this count is
> expected to be small.

The problem is deciding on a per filesystem basis. I was thinking
that all filesytsems of a specific type would use a particular type
of structure, not that specific instances of a filesystem could use
different types....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-17  4:22         ` Dave Chinner
@ 2013-01-17 18:21           ` Glauber Costa
  2013-01-18  0:10             ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-17 18:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, xfs, linux-mm, Ying Han, Suleiman Souhlal,
	linux-fsdevel, Greg Thelen


>> Deepest fears:
>>
>> 1) snakes.
> 
> Snakes are merely poisonous. Drop Bears are far more dangerous :P
> 

fears are irrational anyway...

>> 2) It won't surprise you to know that I am adapting your work, which
>> provides a very sane and helpful API, to memcg shrinking.
>>
>> The dumb and simple approach in there is to copy all lrus that are
>> marked memcg aware at memcg creation time. The API is kept the same,
>> but when you do something like list_lru_add(lru, obj), for instance, we
>> derive the memcg context from obj and relay it to the right list.
> 
> At which point, you don't want the overhead of per-node lists.
> 

This is one of the assumptions we may have to end up doing here.

> The problem I see here is that someone might still need the
> scalability of the per-node lists. If someone runs a large memcg in terms
> of CPU and memory, then we most definitely are going to need to
> retain per-node lists regardless of the fact that the workload is
> running in a constrained environment. And if you are running a mix
> of large and small containers, then one static solution is not going
> to cut it for some workload.
> 

Yes, this is spot on. As a first approach, I think single-node lists
will do.

> This is a problem that superblock contexts don't care about - they
> are global by their very nature. Hence I'm wondering if trying to
> fit these two very different behaviours into the one LRU list is
> the wrong approach.
> 

I am not that much concerned about that, honestly. I like the API, and I
like the fact that it allow me to have the subsystems using it
transparently, just by referring to the "master" lru (the dentry, inode,
etc). It reduces complexity to reuse the data structures, but that is
not paramount.

However, a more flexible data structure in which we could select at
least at creation time if we want per-node lists or not, would be quite
helpful.

And it seems it would be at least moderately helpful to you as well for
usually-small filesystems, so I think it would be a good addition to
next version.

> Consider this: these patches give us a generic LRU list structure.
> It currently uses a list_head in each object for indexing, and we
> are talking about single LRU lists because of this limitation and
> trying to build infrastructure that can support this indexing
> mechanism.
> 
> I think that all of thses problems go away if we replace the
> list_head index in the object with a "struct lru_item" index. To
> start with, it's just a s/list_head/lru_item/ changeover, but from
> there we can expand.
> 
> What I'm getting at is that we want to have multiple axis of
> tracking and reclaim, but we only have a single axis for tracking.
> If the lru_item grew a second list_head called "memcg_lru", then
> suddenly the memcg LRUs can be maintained separately to the global
> (per-superblock) LRU. i.e.:
> 
> struct lru_item {
> 	struct list_head global_list;
> 	struct list_head memcg_list;
> }
> 

I may be misunderstanding you, but that is not how I see it. Your global
list AFAIU, is more like a hook to keep the lists together. The actual
accesses to it are controlled by a parent structure, like the
super-block, which in turns, embeds a shrinker.

So we get (in the sb case), from shrinker to sb, and from sb to dentry
list (or inode). We never care about the global list head.

>From this point on, we "entered" the LRU, but we still don't know which
list to reclaim from: there is one list per node, and we need to figure
out which is our target, based on the flags.

This list selection mechanism is where I am usually hooking memcg: and
for the same way you are using an array - given a node, you want fast
access to the underlying list - so am I. Given the memcg context, I want
to get to the corresponding memcg list.

Now, in my earliest implementations, the memcg would still take me to a
node-wide array, and an extra level would be required. We seem to agree
that (at least as a starting point) getting rid of this extra level, so
the memcg colapses all objects in the same list would provide decent
behavior in most cases, while still keeping the footprint manageable. So
that is what I am pursuing at the moment.


> And then you can use whatever tracking structure you want for a
> memcg LRU list. Indeed, this would allow per-node lists for the
> global LRU, and whatever LRU type is appropriate for the memcg using
> the object (e.g. single list for small memcgs, per-node for large
> memcgs)....
> 

Yes. You can have either a small number of big memcgs or a big number of
small memcgs. So if we adopt selectively per-node scalability, our
memory usage is always bounded by #memcgs x avg_size. It works perfectly.

> i.e. rather than trying to make the infrastructure jump through hoops
> to only have one LRU index per object, have a second index that
> allows memcg's to have a separate LRU index and a method for the
> global LRU structure to find them. This woul dallow memcg specific
> shrinker callouts switch to the memcg LRU rather than the global LRU
> and operate on that. That way we still only instantiate a single
> LRU/shrinker pair per cache context, but the memcg code doesn't need
> to duplicate the entire LRU infrastructure into every memcg that
> contains that type of object for that cache context....
> 
> /me stops rambling....
> 

>> Your current suggestion of going per-node only in the performance
>> critical filesystems could also possibly work, provided this count is
>> expected to be small.
> 
> The problem is deciding on a per filesystem basis. I was thinking
> that all filesytsems of a specific type would use a particular type
> of structure, not that specific instances of a filesystem could use
> different types....
> 

Yes, I referred to "critical" as an attribute of the fs class as well.
Most of the disk filesystems (if not all) would use per-node lists, and
I would guess most of the pseudo fs would do fine with a single-node.



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-17 18:21           ` Glauber Costa
@ 2013-01-18  0:10             ` Dave Chinner
  2013-01-18  0:14               ` Glauber Costa
  2013-01-18  0:51               ` Glauber Costa
  0 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2013-01-18  0:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On Thu, Jan 17, 2013 at 10:21:12AM -0800, Glauber Costa wrote:
> >> Deepest fears:
> >>
> >> 1) snakes.
> > 
> > Snakes are merely poisonous. Drop Bears are far more dangerous :P
> 
> fears are irrational anyway...
> 
> >> 2) It won't surprise you to know that I am adapting your work, which
> >> provides a very sane and helpful API, to memcg shrinking.
> >>
> >> The dumb and simple approach in there is to copy all lrus that are
> >> marked memcg aware at memcg creation time. The API is kept the same,
> >> but when you do something like list_lru_add(lru, obj), for instance, we
> >> derive the memcg context from obj and relay it to the right list.
> > 
> > At which point, you don't want the overhead of per-node lists.
> 
> This is one of the assumptions we may have to end up doing here.

*nod*. Good to get it out in the open to see if we can work around
it....

> > This is a problem that superblock contexts don't care about - they
> > are global by their very nature. Hence I'm wondering if trying to
> > fit these two very different behaviours into the one LRU list is
> > the wrong approach.
> > 
> 
> I am not that much concerned about that, honestly. I like the API, and I
> like the fact that it allow me to have the subsystems using it
> transparently, just by referring to the "master" lru (the dentry, inode,
> etc). It reduces complexity to reuse the data structures, but that is
> not paramount.
> 
> However, a more flexible data structure in which we could select at
> least at creation time if we want per-node lists or not, would be quite
> helpful.

*nod*

> > Consider this: these patches give us a generic LRU list structure.
> > It currently uses a list_head in each object for indexing, and we
> > are talking about single LRU lists because of this limitation and
> > trying to build infrastructure that can support this indexing
> > mechanism.
> > 
> > I think that all of thses problems go away if we replace the
> > list_head index in the object with a "struct lru_item" index. To
> > start with, it's just a s/list_head/lru_item/ changeover, but from
> > there we can expand.
> > 
> > What I'm getting at is that we want to have multiple axis of
> > tracking and reclaim, but we only have a single axis for tracking.
> > If the lru_item grew a second list_head called "memcg_lru", then
> > suddenly the memcg LRUs can be maintained separately to the global
> > (per-superblock) LRU. i.e.:
> > 
> > struct lru_item {
> > 	struct list_head global_list;
> > 	struct list_head memcg_list;
> > }
> > 
> 
> I may be misunderstanding you, but that is not how I see it. Your global
> list AFAIU, is more like a hook to keep the lists together. The actual
> accesses to it are controlled by a parent structure, like the
> super-block, which in turns, embeds a shrinker.
> 
> So we get (in the sb case), from shrinker to sb, and from sb to dentry
> list (or inode). We never care about the global list head.
> 
> From this point on, we "entered" the LRU, but we still don't know which
> list to reclaim from: there is one list per node, and we need to figure
> out which is our target, based on the flags.
> 
> This list selection mechanism is where I am usually hooking memcg: and
> for the same way you are using an array - given a node, you want fast
> access to the underlying list - so am I. Given the memcg context, I want
> to get to the corresponding memcg list.
> 
> Now, in my earliest implementations, the memcg would still take me to a
> node-wide array, and an extra level would be required. We seem to agree
> that (at least as a starting point) getting rid of this extra level, so
> the memcg colapses all objects in the same list would provide decent
> behavior in most cases, while still keeping the footprint manageable. So
> that is what I am pursuing at the moment.

Ah, I think that maybe you misunderstood. There are two main
triggers for reclaim: global memory is short, or a memcg is short on
memory.

To find appropriate objects quickly for reclaim, we need objects on
appropriate lists. E.g. if we are doing global reclaim (e.g. from
kswapd) it means a node is short of memory and needs more. Hence
just walking a per-node list is the most efficient method of doing
this. Having to walk all the memcg LRU lists to find objects on a
specific node is not feasible. OTOH, if we are doing memcg specific
reclaim, the opposite is true.

SO, if we have:

struct lru_list_head {
	struct list_head	head;
	spinlock_t		lock;
	u64			nr_items;
}

struct lru_list {
	struct lru_list_node	*per_node;
	int			numnodes;
	nodemask_t		active_nodes;
	void			*memcg_lists;	/* managed by memcg code */
	....
}

lru_list_init(bool per_node)
{
	numnodes = 1;
	if (pernode)
		numnodes = NRNODES;
	lru_list->pernode = alloc(numnodes * sizeof(struct lru_list_head));
	....
}

And then each object uses:

struct lru_item {
	struct list_head global_list;
	struct list_head memcg_list;
}

and we end up with:

lru_add(struct lru_list *lru, struct lru_item *item)
{
	node_id = min(object_to_nid(item), lru->numnodes);
	
	__lru_add(lru, node_id, &item->global_list);
	if (memcg) {
		memcg_lru = find_memcg_lru(lru->memcg_lists, memcg_id)
		__lru_add_(memcg_lru, node_id, &item->memcg_list);
	}
}

Then when it comes to reclaim, the reclaim information passed tot eh
shrinker needs to indicate either the node to reclaim from or the
memcg_id (or both), so that reclaim can walk the appropriate list to
find objects to reclaim. Then we delete them from both lists and
reclaim the object....

And the memcg lists can instantiate new struct lru_list and index
them appropriately according to some <handwave> criteria. Then all
the generic LRU code cares about is that the memcg lookup returns
the correct struct lru_list for it to operate on...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18  0:10             ` Dave Chinner
@ 2013-01-18  0:14               ` Glauber Costa
  2013-01-18  8:11                 ` Dave Chinner
  2013-01-18  0:51               ` Glauber Costa
  1 sibling, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-18  0:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On 01/17/2013 04:10 PM, Dave Chinner wrote:
> And then each object uses:
> 
> struct lru_item {
> 	struct list_head global_list;
> 	struct list_head memcg_list;
> }
by objects you mean dentries, inodes, and the such, right?
Would it be acceptable to you?

We've been of course doing our best to avoid increasing the size of the
objects, therefore this is something we've never mentioned. However, if
it would be acceptable from the fs POV, this would undoubtedly make our
life extremely easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18  0:10             ` Dave Chinner
  2013-01-18  0:14               ` Glauber Costa
@ 2013-01-18  0:51               ` Glauber Costa
  2013-01-18  8:08                 ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-18  0:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, xfs, linux-mm, Ying Han, Suleiman Souhlal,
	linux-fsdevel, Greg Thelen

On 01/17/2013 04:10 PM, Dave Chinner wrote:
> and we end up with:
> 
> lru_add(struct lru_list *lru, struct lru_item *item)
> {
> 	node_id = min(object_to_nid(item), lru->numnodes);
> 	
> 	__lru_add(lru, node_id, &item->global_list);
> 	if (memcg) {
> 		memcg_lru = find_memcg_lru(lru->memcg_lists, memcg_id)
> 		__lru_add_(memcg_lru, node_id, &item->memcg_list);
> 	}
> }

A follow up thought: If we have multiple memcgs, and global pressure
kicks in (meaning none of them are particularly under pressure),
shouldn't we try to maintain fairness among them and reclaim equal
proportions from them all the same way we do with sb's these days, for
instance?

I would argue that if your memcg is small, the list of dentries is
small: scan it all for the nodes you want shouldn't hurt.

if the memcg is big, it will have per-node lists anyway.

Given that, do we really want to pay the price of two list_heads in the
objects?


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18  0:51               ` Glauber Costa
@ 2013-01-18  8:08                 ` Dave Chinner
  2013-01-18 19:01                   ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2013-01-18  8:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On Thu, Jan 17, 2013 at 04:51:03PM -0800, Glauber Costa wrote:
> On 01/17/2013 04:10 PM, Dave Chinner wrote:
> > and we end up with:
> > 
> > lru_add(struct lru_list *lru, struct lru_item *item)
> > {
> > 	node_id = min(object_to_nid(item), lru->numnodes);
> > 	
> > 	__lru_add(lru, node_id, &item->global_list);
> > 	if (memcg) {
> > 		memcg_lru = find_memcg_lru(lru->memcg_lists, memcg_id)
> > 		__lru_add_(memcg_lru, node_id, &item->memcg_list);
> > 	}
> > }
> 
> A follow up thought: If we have multiple memcgs, and global pressure
> kicks in (meaning none of them are particularly under pressure),
> shouldn't we try to maintain fairness among them and reclaim equal
> proportions from them all the same way we do with sb's these days, for
> instance?

I don't like the complexity. The global lists will be reclaimed in
LRU order, so it's going to be as fair as can be. If there's a memcg
that has older unused objectsi than the others, then froma global
perspective they should be reclaimed first because the memcg is not
using them...

> I would argue that if your memcg is small, the list of dentries is
> small: scan it all for the nodes you want shouldn't hurt.

on the contrary - the memcg might be small, but what happens if
someone ran a find across all the filesytsems on the system in it?
Then the LRU will be huge, and scanning expensive...

We can't make static decisions about small and large, and we can't
trust heuristics to get it right, either. If we have a single list,
we don't/can't do node-aware reclaim efficiently and so shouldn't
even try.

> if the memcg is big, it will have per-node lists anyway.

But may have no need for them due to the workload. ;)

> Given that, do we really want to pay the price of two list_heads
> in the objects?

I'm just looking at ways at making the infrastructure sane. If the
cost is an extra 16 bytes per object on a an LRU, then that a small
price to pay for having robust memory reclaim infrastructure....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18  0:14               ` Glauber Costa
@ 2013-01-18  8:11                 ` Dave Chinner
  2013-01-18 19:10                   ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2013-01-18  8:11 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On Thu, Jan 17, 2013 at 04:14:10PM -0800, Glauber Costa wrote:
> On 01/17/2013 04:10 PM, Dave Chinner wrote:
> > And then each object uses:
> > 
> > struct lru_item {
> > 	struct list_head global_list;
> > 	struct list_head memcg_list;
> > }
> by objects you mean dentries, inodes, and the such, right?

Yup.

> Would it be acceptable to you?

If it works the way I think it should, then yes.

> We've been of course doing our best to avoid increasing the size of the
> objects, therefore this is something we've never mentioned. However, if
> it would be acceptable from the fs POV, this would undoubtedly make our
> life extremely easier.

I've been trying hard to work out how to avoid increasing the size
of structures as well. But if we can't work out how to implement
something sanely with only a single list head per object to work
from, then increasing the size of objects is something that we need
to consider if it solves all the problems we are trying to solve.

i.e. if adding a second list head makes the code dumb, simple,
obviously correct and hard to break then IMO it's a no-brainer.
But we have to tick all the right boxes first...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18  8:08                 ` Dave Chinner
@ 2013-01-18 19:01                   ` Glauber Costa
  0 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2013-01-18 19:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On 01/18/2013 12:08 AM, Dave Chinner wrote:
> On Thu, Jan 17, 2013 at 04:51:03PM -0800, Glauber Costa wrote:
>> On 01/17/2013 04:10 PM, Dave Chinner wrote:
>>> and we end up with:
>>>
>>> lru_add(struct lru_list *lru, struct lru_item *item)
>>> {
>>> 	node_id = min(object_to_nid(item), lru->numnodes);
>>> 	
>>> 	__lru_add(lru, node_id, &item->global_list);
>>> 	if (memcg) {
>>> 		memcg_lru = find_memcg_lru(lru->memcg_lists, memcg_id)
>>> 		__lru_add_(memcg_lru, node_id, &item->memcg_list);
>>> 	}
>>> }
>>
>> A follow up thought: If we have multiple memcgs, and global pressure
>> kicks in (meaning none of them are particularly under pressure),
>> shouldn't we try to maintain fairness among them and reclaim equal
>> proportions from them all the same way we do with sb's these days, for
>> instance?
> 
> I don't like the complexity. The global lists will be reclaimed in
> LRU order, so it's going to be as fair as can be. If there's a memcg
> that has older unused objectsi than the others, then froma global
> perspective they should be reclaimed first because the memcg is not
> using them...
> 

Disclaimer: I don't necessarily disagree with you, but let's explore the
playing field...

How do we know ?

The whole point of memcg is maintaining a best-effort isolation between
different logically separated entities.

If we need to reclaim 1k dentries, and your particular memcg has the
first 1k dentries in the LRU, it is not obvious that you should be hurt
more than others just because you are idle for longer.

You could of course argue that under global reclaim, we should lift our
fairness attempts in this case, specially given cost and complexity
trade offs, but still.


>> I would argue that if your memcg is small, the list of dentries is
>> small: scan it all for the nodes you want shouldn't hurt.
> 
> on the contrary - the memcg might be small, but what happens if
> someone ran a find across all the filesytsems on the system in it?
> Then the LRU will be huge, and scanning expensive...
> 
The memcg being small means that the size of the list is limited, no?
It can never use more memory than X bytes, which translates to Y
entries. If you do it, you will trigger reclaim for yourself.


> We can't make static decisions about small and large, and we can't
> trust heuristics to get it right, either. If we have a single list,
> we don't/can't do node-aware reclaim efficiently and so shouldn't
> even try.
> 
>> if the memcg is big, it will have per-node lists anyway.
> 
> But may have no need for them due to the workload. ;)
> 
sure.

>> Given that, do we really want to pay the price of two list_heads
>> in the objects?
> 
> I'm just looking at ways at making the infrastructure sane. If the
> cost is an extra 16 bytes per object on a an LRU, then that a small
> price to pay for having robust memory reclaim infrastructure....
> 

We still need to evaluate that versus the solution to use some sort of
dynamic node-list allocation, which would allow us to address online
nodes instead of possible nodes. This way, memory overhead may very well
be bounded enough so everybody gets to be per-node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18  8:11                 ` Dave Chinner
@ 2013-01-18 19:10                   ` Glauber Costa
  2013-01-19  0:10                     ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-18 19:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On 01/18/2013 12:11 AM, Dave Chinner wrote:
> On Thu, Jan 17, 2013 at 04:14:10PM -0800, Glauber Costa wrote:
>> On 01/17/2013 04:10 PM, Dave Chinner wrote:
>>> And then each object uses:
>>>
>>> struct lru_item {
>>> 	struct list_head global_list;
>>> 	struct list_head memcg_list;
>>> }
>> by objects you mean dentries, inodes, and the such, right?
> 
> Yup.
> 
>> Would it be acceptable to you?
> 
> If it works the way I think it should, then yes.
> 
>> We've been of course doing our best to avoid increasing the size of the
>> objects, therefore this is something we've never mentioned. However, if
>> it would be acceptable from the fs POV, this would undoubtedly make our
>> life extremely easier.
> 
> I've been trying hard to work out how to avoid increasing the size
> of structures as well. But if we can't work out how to implement
> something sanely with only a single list head per object to work
> from, then increasing the size of objects is something that we need
> to consider if it solves all the problems we are trying to solve.
> 
> i.e. if adding a second list head makes the code dumb, simple,
> obviously correct and hard to break then IMO it's a no-brainer.
> But we have to tick all the right boxes first...
> 

One of our main efforts recently has been trying to reduce memcg impact
when it is not in use, even if its compiled in. So what really bothers
me here is the fact that we are increasing the size of dentries and
inodes no matter what.

Still within the idea of exploring the playing field, would an
indirection be worth it ?
We would increase the total per-object memory usage by 8 bytes instead
of 16: the dentry gets a pointer, and a separate allocation for the
list_lru. When memcg is not being actively used, we can allocate a
smaller list_lru, with only the global list. When it is, we allocate the
whole thing.

Embedding the lru is certainly easier from my PoV, the main question is
how much do you, as a fs representative, value those 8 bytes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-18 19:10                   ` Glauber Costa
@ 2013-01-19  0:10                     ` Dave Chinner
  2013-01-19  0:13                       ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2013-01-19  0:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On Fri, Jan 18, 2013 at 11:10:00AM -0800, Glauber Costa wrote:
> On 01/18/2013 12:11 AM, Dave Chinner wrote:
> > On Thu, Jan 17, 2013 at 04:14:10PM -0800, Glauber Costa wrote:
> >> On 01/17/2013 04:10 PM, Dave Chinner wrote:
> >>> And then each object uses:
> >>>
> >>> struct lru_item {
> >>> 	struct list_head global_list;
> >>> 	struct list_head memcg_list;
> >>> }
> >> by objects you mean dentries, inodes, and the such, right?
> > 
> > Yup.
> > 
> >> Would it be acceptable to you?
> > 
> > If it works the way I think it should, then yes.
> > 
> >> We've been of course doing our best to avoid increasing the size of the
> >> objects, therefore this is something we've never mentioned. However, if
> >> it would be acceptable from the fs POV, this would undoubtedly make our
> >> life extremely easier.
> > 
> > I've been trying hard to work out how to avoid increasing the size
> > of structures as well. But if we can't work out how to implement
> > something sanely with only a single list head per object to work
> > from, then increasing the size of objects is something that we need
> > to consider if it solves all the problems we are trying to solve.
> > 
> > i.e. if adding a second list head makes the code dumb, simple,
> > obviously correct and hard to break then IMO it's a no-brainer.
> > But we have to tick all the right boxes first...
> > 
> 
> One of our main efforts recently has been trying to reduce memcg impact
> when it is not in use, even if its compiled in. So what really bothers
> me here is the fact that we are increasing the size of dentries and
> inodes no matter what.
> 
> Still within the idea of exploring the playing field, would an
> indirection be worth it ?
> We would increase the total per-object memory usage by 8 bytes instead
> of 16: the dentry gets a pointer, and a separate allocation for the
> list_lru.

A separate allocation is really not an option. We can't do an
allocation in where dentries/inodes/other objects are added to the
LRU because they are under object state spinlocks, and adding a
potential memory allocation failure to the "add to lru" case is
pretty nasty, IMO.

> Embedding the lru is certainly easier from my PoV, the main question is
> how much do you, as a fs representative, value those 8 bytes?

Some things are worth using a few extra bytes, and thats my point.
The question of whether memcg enabled kernels need to grow objects
for sane memory management behaviour is what I'm trying to explore
here. We might end up deciding that a single list head is all that
is needed, but lets not constrain our thinking to "we must only use
elements existing in the current structures" and hence rule out
solutions that may work far better than any single list algorithm
possibly can without having first explored them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/19] list_lru: per-node list infrastructure
  2013-01-19  0:10                     ` Dave Chinner
@ 2013-01-19  0:13                       ` Glauber Costa
  0 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2013-01-19  0:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Greg Thelen, Ying Han,
	Suleiman Souhlal

On 01/18/2013 04:10 PM, Dave Chinner wrote:
> On Fri, Jan 18, 2013 at 11:10:00AM -0800, Glauber Costa wrote:
>> On 01/18/2013 12:11 AM, Dave Chinner wrote:
>>> On Thu, Jan 17, 2013 at 04:14:10PM -0800, Glauber Costa wrote:
>>>> On 01/17/2013 04:10 PM, Dave Chinner wrote:
>>>>> And then each object uses:
>>>>>
>>>>> struct lru_item {
>>>>> 	struct list_head global_list;
>>>>> 	struct list_head memcg_list;
>>>>> }
>>>> by objects you mean dentries, inodes, and the such, right?
>>>
>>> Yup.
>>>
>>>> Would it be acceptable to you?
>>>
>>> If it works the way I think it should, then yes.
>>>
>>>> We've been of course doing our best to avoid increasing the size of the
>>>> objects, therefore this is something we've never mentioned. However, if
>>>> it would be acceptable from the fs POV, this would undoubtedly make our
>>>> life extremely easier.
>>>
>>> I've been trying hard to work out how to avoid increasing the size
>>> of structures as well. But if we can't work out how to implement
>>> something sanely with only a single list head per object to work
>>> from, then increasing the size of objects is something that we need
>>> to consider if it solves all the problems we are trying to solve.
>>>
>>> i.e. if adding a second list head makes the code dumb, simple,
>>> obviously correct and hard to break then IMO it's a no-brainer.
>>> But we have to tick all the right boxes first...
>>>
>>
>> One of our main efforts recently has been trying to reduce memcg impact
>> when it is not in use, even if its compiled in. So what really bothers
>> me here is the fact that we are increasing the size of dentries and
>> inodes no matter what.
>>
>> Still within the idea of exploring the playing field, would an
>> indirection be worth it ?
>> We would increase the total per-object memory usage by 8 bytes instead
>> of 16: the dentry gets a pointer, and a separate allocation for the
>> list_lru.
> 
> A separate allocation is really not an option. We can't do an
> allocation in where dentries/inodes/other objects are added to the
> LRU because they are under object state spinlocks, and adding a
> potential memory allocation failure to the "add to lru" case is
> pretty nasty, IMO.
> 

That would of course happen on dentry creation time, not lru add time.
It is totally possible since at creation time, we already know if memcg
is enabled or not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
                   ` (20 preceding siblings ...)
  2012-12-20 11:45 ` Glauber Costa
@ 2013-01-21 16:08 ` Glauber Costa
  2013-01-21 23:21   ` Dave Chinner
  21 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-21 16:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, Johannes Weiner, linux-kernel, xfs

On 11/28/2012 03:14 AM, Dave Chinner wrote:
> [PATCH 09/19] list_lru: per-node list infrastructure
> 
> This makes the generic LRU list much more scalable by changing it to
> a {list,lock,count} tuple per node. There are no external API
> changes to this changeover, so is transparent to current users.
> 
> [PATCH 10/19] shrinker: add node awareness
> [PATCH 11/19] fs: convert inode and dentry shrinking to be node
> 
> Adds a nodemask to the struct shrink_control for callers of
> shrink_slab to set appropriately for their reclaim context. This
> nodemask is then passed by the inode and dentry cache reclaim code
> to the generic LRU list code to implement node aware shrinking.

I have a follow up question that popped up from a discussion between me
and my very American friend Johnny Wheeler, also known as Johannes
Weiner (CC'd). I actually remember we discussing this, but don't fully
remember the outcome. And since I can't find it anywhere, it must have
been in a media other than e-mail. So I thought it would do no harm in
at least documenting it...

Why are we doing this per-node, instead of per-zone?

It seems to me that the goal is to collapse all zones of a node into a
single list, but since the number of zones is not terribly larger than
the number of nodes, and zones is where the pressure comes from, what do
we really gain from this?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2013-01-21 16:08 ` Glauber Costa
@ 2013-01-21 23:21   ` Dave Chinner
  2013-01-23 14:36     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2013-01-21 23:21 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-fsdevel, linux-mm, Johannes Weiner, linux-kernel, xfs

On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote:
> On 11/28/2012 03:14 AM, Dave Chinner wrote:
> > [PATCH 09/19] list_lru: per-node list infrastructure
> > 
> > This makes the generic LRU list much more scalable by changing it to
> > a {list,lock,count} tuple per node. There are no external API
> > changes to this changeover, so is transparent to current users.
> > 
> > [PATCH 10/19] shrinker: add node awareness
> > [PATCH 11/19] fs: convert inode and dentry shrinking to be node
> > 
> > Adds a nodemask to the struct shrink_control for callers of
> > shrink_slab to set appropriately for their reclaim context. This
> > nodemask is then passed by the inode and dentry cache reclaim code
> > to the generic LRU list code to implement node aware shrinking.
> 
> I have a follow up question that popped up from a discussion between me
> and my very American friend Johnny Wheeler, also known as Johannes
> Weiner (CC'd). I actually remember we discussing this, but don't fully
> remember the outcome. And since I can't find it anywhere, it must have
> been in a media other than e-mail. So I thought it would do no harm in
> at least documenting it...
> 
> Why are we doing this per-node, instead of per-zone?
> 
> It seems to me that the goal is to collapse all zones of a node into a
> single list, but since the number of zones is not terribly larger than
> the number of nodes, and zones is where the pressure comes from, what do
> we really gain from this?

The number is quite a bit higher - there are platforms with 5 zones
to a node. The reality is, though, for most platforms slab
allocations come from a single zone - they never come from ZONE_DMA,
ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason
for having cache LRUs for these zones. So, two zones at most.

And then there's the complexity issue - it's simple/trivial to user
per node lists, node masks, etc. It's an obvious abstraction that
everyone understands, is simle to understand, acheives exactly the
purpose that is needed and is not tied to the /current/
implementation of the current VM memory management code.

I don't see any good reason for tying LRUs to MM zones. the
original implementation of the per-node shrinkers by Nick Piggin did
this: the LRUs for the dentry and inode caches were embedded in the
struct zone, and it wasn't generically extensible because of that.
i.e. node-aware shrinkers were directly influenced by the zone
infrastructure and so the internal implementation of the mm
subsystem started leaking out and determining how completely
unrelated subsystems need to implement their own cache
management.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2013-01-21 23:21   ` Dave Chinner
@ 2013-01-23 14:36     ` Glauber Costa
  2013-01-23 23:46       ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2013-01-23 14:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Johannes Weiner

On 01/22/2013 03:21 AM, Dave Chinner wrote:
> On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote:
>> On 11/28/2012 03:14 AM, Dave Chinner wrote:
>>> [PATCH 09/19] list_lru: per-node list infrastructure
>>>
>>> This makes the generic LRU list much more scalable by changing it to
>>> a {list,lock,count} tuple per node. There are no external API
>>> changes to this changeover, so is transparent to current users.
>>>
>>> [PATCH 10/19] shrinker: add node awareness
>>> [PATCH 11/19] fs: convert inode and dentry shrinking to be node
>>>
>>> Adds a nodemask to the struct shrink_control for callers of
>>> shrink_slab to set appropriately for their reclaim context. This
>>> nodemask is then passed by the inode and dentry cache reclaim code
>>> to the generic LRU list code to implement node aware shrinking.
>>
>> I have a follow up question that popped up from a discussion between me
>> and my very American friend Johnny Wheeler, also known as Johannes
>> Weiner (CC'd). I actually remember we discussing this, but don't fully
>> remember the outcome. And since I can't find it anywhere, it must have
>> been in a media other than e-mail. So I thought it would do no harm in
>> at least documenting it...
>>
>> Why are we doing this per-node, instead of per-zone?
>>
>> It seems to me that the goal is to collapse all zones of a node into a
>> single list, but since the number of zones is not terribly larger than
>> the number of nodes, and zones is where the pressure comes from, what do
>> we really gain from this?
> 
> The number is quite a bit higher - there are platforms with 5 zones
> to a node. The reality is, though, for most platforms slab
> allocations come from a single zone - they never come from ZONE_DMA,
> ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason
> for having cache LRUs for these zones. So, two zones at most.
> 
Yes, but one would expect that most of those special zones would be
present only in the first node, no? (correct me if I am wrong here).

Over that, things should be pretty much addressable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
  2013-01-23 14:36     ` Glauber Costa
@ 2013-01-23 23:46       ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2013-01-23 23:46 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs, Johannes Weiner

On Wed, Jan 23, 2013 at 06:36:33PM +0400, Glauber Costa wrote:
> On 01/22/2013 03:21 AM, Dave Chinner wrote:
> > On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote:
> >> On 11/28/2012 03:14 AM, Dave Chinner wrote:
> >>> [PATCH 09/19] list_lru: per-node list infrastructure
> >>>
> >>> This makes the generic LRU list much more scalable by changing it to
> >>> a {list,lock,count} tuple per node. There are no external API
> >>> changes to this changeover, so is transparent to current users.
> >>>
> >>> [PATCH 10/19] shrinker: add node awareness
> >>> [PATCH 11/19] fs: convert inode and dentry shrinking to be node
> >>>
> >>> Adds a nodemask to the struct shrink_control for callers of
> >>> shrink_slab to set appropriately for their reclaim context. This
> >>> nodemask is then passed by the inode and dentry cache reclaim code
> >>> to the generic LRU list code to implement node aware shrinking.
> >>
> >> I have a follow up question that popped up from a discussion between me
> >> and my very American friend Johnny Wheeler, also known as Johannes
> >> Weiner (CC'd). I actually remember we discussing this, but don't fully
> >> remember the outcome. And since I can't find it anywhere, it must have
> >> been in a media other than e-mail. So I thought it would do no harm in
> >> at least documenting it...
> >>
> >> Why are we doing this per-node, instead of per-zone?
> >>
> >> It seems to me that the goal is to collapse all zones of a node into a
> >> single list, but since the number of zones is not terribly larger than
> >> the number of nodes, and zones is where the pressure comes from, what do
> >> we really gain from this?
> > 
> > The number is quite a bit higher - there are platforms with 5 zones
> > to a node. The reality is, though, for most platforms slab
> > allocations come from a single zone - they never come from ZONE_DMA,
> > ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason
> > for having cache LRUs for these zones. So, two zones at most.
> > 
> Yes, but one would expect that most of those special zones would be
> present only in the first node, no? (correct me if I am wrong here).

As I understand it, every node has an identical zone setup (i.e. a
flat array of MAX_NR_ZONES zones in the struct pglist_data), and
pages are simply places the in the appropriate zones on each node...

Also, IIUC, the behaviour of the zones one each node is architecture
dependent, we can't make assumptions that certain zones are only
ever used on the first node...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
  2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
  2012-11-28  1:13   ` Chris Wilson
@ 2013-06-07 13:37   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-06-07 13:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: glommer, linux-kernel, linux-fsdevel, linux-mm, xfs

On Wed, Nov 28, 2012 at 10:14:44AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Convert the driver shrinkers to the new API. Most changes are
> compile tested only because I either don't have the hardware or it's
> staging stuff.

I presume that the the i915, ttm_page_alloc and ttm_page_alloc_dma
were tested by you? They cover the most common graphic drivers.

> 
> FWIW, the md and android code is pretty good, but the rest of it
> makes me want to claw my eyes out.  The amount of broken code I just
> encountered is mind boggling.  I've added comments explaining what
> is broken, but I fear that some of the code would be best dealt with
> by being dragged behind the bike shed, burying in mud up to it's
> neck and then run over repeatedly with a blunt lawn mower.
> 
> Special mention goes to the zcache/zcache2 drivers. They can't
> co-exist in the build at the same time, they are under different
> menu options in menuconfig, they only show up when you've got the
> right set of mm subsystem options configured and so even compile
> testing is an exercise in pulling teeth.  And that doesn't even take
> into account the horrible, broken code...

Hm, I was under the impression that there is only one zcache code?
Are you referring to ramster perhaps?

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  drivers/gpu/drm/i915/i915_dma.c           |    4 +-
>  drivers/gpu/drm/i915/i915_gem.c           |   64 +++++++++++++++++++++-------
>  drivers/gpu/drm/ttm/ttm_page_alloc.c      |   48 ++++++++++++++-------
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c  |   55 +++++++++++++++---------
>  drivers/md/dm-bufio.c                     |   65 +++++++++++++++++++----------
>  drivers/staging/android/ashmem.c          |   44 ++++++++++++-------
>  drivers/staging/android/lowmemorykiller.c |   60 +++++++++++++++++---------
>  drivers/staging/ramster/zcache-main.c     |   58 ++++++++++++++++++-------
>  drivers/staging/zcache/zcache-main.c      |   40 ++++++++++--------
>  9 files changed, 297 insertions(+), 141 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 61ae104..0ddec32 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1658,7 +1658,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
>  	return 0;
>  
>  out_gem_unload:
> -	if (dev_priv->mm.inactive_shrinker.shrink)
> +	if (dev_priv->mm.inactive_shrinker.scan_objects)
>  		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
>  
>  	if (dev->pdev->msi_enabled)
> @@ -1695,7 +1695,7 @@ int i915_driver_unload(struct drm_device *dev)
>  
>  	i915_teardown_sysfs(dev);
>  
> -	if (dev_priv->mm.inactive_shrinker.shrink)
> +	if (dev_priv->mm.inactive_shrinker.scan_objects)
>  		unregister_shrinker(&dev_priv->mm.inactive_shrinker);
>  
>  	mutex_lock(&dev->struct_mutex);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 107f09b..ceab752 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -53,8 +53,10 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
>  					 struct drm_i915_fence_reg *fence,
>  					 bool enable);
>  
> -static int i915_gem_inactive_shrink(struct shrinker *shrinker,
> +static long i915_gem_inactive_count(struct shrinker *shrinker,
>  				    struct shrink_control *sc);
> +static long i915_gem_inactive_scan(struct shrinker *shrinker,
> +				   struct shrink_control *sc);
>  static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
>  static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
>  static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
> @@ -4197,7 +4199,8 @@ i915_gem_load(struct drm_device *dev)
>  
>  	dev_priv->mm.interruptible = true;
>  
> -	dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
> +	dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
> +	dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
>  	dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
>  	register_shrinker(&dev_priv->mm.inactive_shrinker);
>  }
> @@ -4407,35 +4410,64 @@ void i915_gem_release(struct drm_device *dev, struct drm_file *file)
>  	spin_unlock(&file_priv->mm.lock);
>  }
>  
> -static int
> -i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> + * size is PAGE_SIZE. The shrinker doesn't work on bytes - it works on
> + * *objects*. So it passes a nr_to_scan of 128 objects, which is interpreted
> + * here to mean "free 128 bytes". That means a single object will be freed, as
> + * the minimum object size is a page.
> + *
> + * But the craziest part comes when i915_gem_purge() has walked all the objects
> + * and can't free any memory. That results in i915_gem_shrink_all() being
> + * called, which idles the GPU and frees everything the driver has in it's
> + * active and inactive lists. It's basically hitting the driver with a great big
> + * hammer because it was busy doing stuff when something else generated memory
> + * pressure. This doesn't seem particularly wise...
> + */
> +static long
> +i915_gem_inactive_scan(struct shrinker *shrinker, struct shrink_control *sc)
>  {
>  	struct drm_i915_private *dev_priv =
>  		container_of(shrinker,
>  			     struct drm_i915_private,
>  			     mm.inactive_shrinker);
>  	struct drm_device *dev = dev_priv->dev;
> -	struct drm_i915_gem_object *obj;
> -	int nr_to_scan = sc->nr_to_scan;
> -	int cnt;
> +	long freed = 0;
>  
>  	if (!mutex_trylock(&dev->struct_mutex))
>  		return 0;
>  
> -	if (nr_to_scan) {
> -		nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
> -		if (nr_to_scan > 0)
> -			i915_gem_shrink_all(dev_priv);
> -	}
> +	freed = i915_gem_purge(dev_priv, sc->nr_to_scan);
> +	if (freed < sc->nr_to_scan)
> +		i915_gem_shrink_all(dev_priv);
> +
> +	mutex_unlock(&dev->struct_mutex);
> +	return freed;
> +}
> +
> +static long
> +i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
> +{
> +	struct drm_i915_private *dev_priv =
> +		container_of(shrinker,
> +			     struct drm_i915_private,
> +			     mm.inactive_shrinker);
> +	struct drm_device *dev = dev_priv->dev;
> +	struct drm_i915_gem_object *obj;
> +	long count = 0;
> +
> +	if (!mutex_trylock(&dev->struct_mutex))
> +		return 0;
>  
> -	cnt = 0;
>  	list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
>  		if (obj->pages_pin_count == 0)
> -			cnt += obj->base.size >> PAGE_SHIFT;
> +			count += obj->base.size >> PAGE_SHIFT;
>  	list_for_each_entry(obj, &dev_priv->mm.bound_list, gtt_list)
>  		if (obj->pin_count == 0 && obj->pages_pin_count == 0)
> -			cnt += obj->base.size >> PAGE_SHIFT;
> +			count += obj->base.size >> PAGE_SHIFT;
>  
>  	mutex_unlock(&dev->struct_mutex);
> -	return cnt;
> +	return count;
>  }
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
> index bd2a3b4..83058a2 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
> @@ -377,28 +377,28 @@ out:
>  	return nr_free;
>  }
>  
> -/* Get good estimation how many pages are free in pools */
> -static int ttm_pool_get_num_unused_pages(void)
> -{
> -	unsigned i;
> -	int total = 0;
> -	for (i = 0; i < NUM_POOLS; ++i)
> -		total += _manager->pools[i].npages;
> -
> -	return total;
> -}
> -
>  /**
>   * Callback for mm to request pool to reduce number of page held.
> + *
> + * XXX: (dchinner) Deadlock warning!
> + *
> + * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
> + * this can deadlock when called a sc->gfp_mask that is not equal to
> + * GFP_KERNEL.
> + *
> + * This code is crying out for a shrinker per pool....
>   */
> -static int ttm_pool_mm_shrink(struct shrinker *shrink,
> -			      struct shrink_control *sc)
> +static long
> +ttm_pool_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)

Please don't change the style.
>  {
>  	static atomic_t start_pool = ATOMIC_INIT(0);
>  	unsigned i;
>  	unsigned pool_offset = atomic_add_return(1, &start_pool);
>  	struct ttm_page_pool *pool;
>  	int shrink_pages = sc->nr_to_scan;
> +	long freed = 0;
>  
>  	pool_offset = pool_offset % NUM_POOLS;
>  	/* select start pool in round robin fashion */
> @@ -408,14 +408,30 @@ static int ttm_pool_mm_shrink(struct shrinker *shrink,
>  			break;
>  		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
>  		shrink_pages = ttm_page_pool_free(pool, nr_free);
> +		freed += nr_free - shrink_pages;
>  	}
> -	/* return estimated number of unused pages in pool */
> -	return ttm_pool_get_num_unused_pages();
> +	return freed;
> +}
> +
> +
> +static long
> +ttm_pool_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)

Please don't change the style. Why did you move it down?

> +{
> +	unsigned i;
> +	long count = 0;
> +
> +	for (i = 0; i < NUM_POOLS; ++i)
> +		count += _manager->pools[i].npages;
> +
> +	return count;
>  }
>  
>  static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
>  {
> -	manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
> +	manager->mm_shrink.count_objects = &ttm_pool_shrink_count;
> +	manager->mm_shrink.scan_objects = &ttm_pool_shrink_scan;
>  	manager->mm_shrink.seeks = 1;
>  	register_shrinker(&manager->mm_shrink);
>  }
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> index b8b3943..b3b4f99 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> @@ -918,19 +918,6 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(ttm_dma_populate);
>  
> -/* Get good estimation how many pages are free in pools */
> -static int ttm_dma_pool_get_num_unused_pages(void)
> -{
> -	struct device_pools *p;
> -	unsigned total = 0;
> -
> -	mutex_lock(&_manager->lock);
> -	list_for_each_entry(p, &_manager->pools, pools)
> -		total += p->pool->npages_free;
> -	mutex_unlock(&_manager->lock);
> -	return total;
> -}
> -
>  /* Put all pages in pages list to correct pool to wait for reuse */
>  void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
>  {
> @@ -1002,18 +989,31 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
>  
>  /**
>   * Callback for mm to request pool to reduce number of page held.
> + *
> + * XXX: (dchinner) Deadlock warning!
> + *
> + * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
> + * needs to be paid to sc->gfp_mask to determine if this can be done or not.
> + * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really

would be
> + * bad.

It could use (ttm_dma_page_pool_free) use GFP_ATOMIC as the allocation is
just for an array of pages (so that we can iterate over all of the them
in the list and find the candidates). At the end of the
ttm_dma_page_pool_free it ends up freeing it.

The same treatment can be applied to the  ttm_page_pool_free.
> + *
> + * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
> + * shrinkers

I am not entirely clear how this comment is useful in the code?
>   */
> -static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
> -				  struct shrink_control *sc)
> +static long
> +ttm_dma_pool_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
>  {
>  	static atomic_t start_pool = ATOMIC_INIT(0);
>  	unsigned idx = 0;
>  	unsigned pool_offset = atomic_add_return(1, &start_pool);
>  	unsigned shrink_pages = sc->nr_to_scan;
>  	struct device_pools *p;
> +	long freed = 0;
>  
>  	if (list_empty(&_manager->pools))
> -		return 0;
> +		return -1;

Could there be a set of #defines for these values?
>  
>  	mutex_lock(&_manager->lock);
>  	pool_offset = pool_offset % _manager->npools;
> @@ -1029,18 +1029,35 @@ static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
>  			continue;
>  		nr_free = shrink_pages;
>  		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
> +		freed += nr_free - shrink_pages;
> +
>  		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
>  			 p->pool->dev_name, p->pool->name, current->pid,
>  			 nr_free, shrink_pages);
>  	}
>  	mutex_unlock(&_manager->lock);
> -	/* return estimated number of unused pages in pool */
> -	return ttm_dma_pool_get_num_unused_pages();
> +	return freed;
> +}
> +
> +static long
> +ttm_dma_pool_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)

Again, please don't change the styleguide and also why move the code?

> +{
> +	struct device_pools *p;
> +	long count = 0;
> +
> +	mutex_lock(&_manager->lock);
> +	list_for_each_entry(p, &_manager->pools, pools)
> +		count += p->pool->npages_free;
> +	mutex_unlock(&_manager->lock);
> +	return count;
>  }
>  
>  static void ttm_dma_pool_mm_shrink_init(struct ttm_pool_manager *manager)
>  {
> -	manager->mm_shrink.shrink = &ttm_dma_pool_mm_shrink;
> +	manager->mm_shrink.count_objects = &ttm_dma_pool_shrink_count;
> +	manager->mm_shrink.scan_objects = &ttm_dma_pool_shrink_scan;
>  	manager->mm_shrink.seeks = 1;
>  	register_shrinker(&manager->mm_shrink);
>  }
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 651ca79..0898bf5 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -1359,62 +1359,80 @@ static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
>  				unsigned long max_jiffies)
>  {
>  	if (jiffies - b->last_accessed < max_jiffies)
> -		return 1;
> +		return 0;
>  
>  	if (!(gfp & __GFP_IO)) {
>  		if (test_bit(B_READING, &b->state) ||
>  		    test_bit(B_WRITING, &b->state) ||
>  		    test_bit(B_DIRTY, &b->state))
> -			return 1;
> +			return 0;
>  	}
>  
>  	if (b->hold_count)
> -		return 1;
> +		return 0;
>  
>  	__make_buffer_clean(b);
>  	__unlink_buffer(b);
>  	__free_buffer_wake(b);
>  
> -	return 0;
> +	return 1;
>  }
>  
> -static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> -		   struct shrink_control *sc)
> +static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> +		   gfp_t gfp_mask)
>  {
>  	int l;
>  	struct dm_buffer *b, *tmp;
> +	long freed = 0;
>  
>  	for (l = 0; l < LIST_SIZE; l++) {
> -		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
> -			if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
> -			    !--nr_to_scan)
> -				return;
> +		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
> +			freed += __cleanup_old_buffer(b, gfp_mask, 0);
> +			if (!--nr_to_scan)
> +				break;
> +		}
>  		dm_bufio_cond_resched();
>  	}
> +	return freed;
>  }
>  
> -static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
> +static long
> +dm_bufio_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
>  {
>  	struct dm_bufio_client *c =
> -	    container_of(shrinker, struct dm_bufio_client, shrinker);
> -	unsigned long r;
> -	unsigned long nr_to_scan = sc->nr_to_scan;
> +	    container_of(shrink, struct dm_bufio_client, shrinker);
> +	long freed;
>  
>  	if (sc->gfp_mask & __GFP_IO)
>  		dm_bufio_lock(c);
>  	else if (!dm_bufio_trylock(c))
> -		return !nr_to_scan ? 0 : -1;
> +		return -1;
>  
> -	if (nr_to_scan)
> -		__scan(c, nr_to_scan, sc);
> +	freed  = __scan(c, sc->nr_to_scan, sc->gfp_mask);
> +	dm_bufio_unlock(c);
> +	return freed;
> +}
>  
> -	r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
> -	if (r > INT_MAX)
> -		r = INT_MAX;
> +static long
> +dm_bufio_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
> +{
> +	struct dm_bufio_client *c =
> +	    container_of(shrink, struct dm_bufio_client, shrinker);
> +	long count;
> +
> +	if (sc->gfp_mask & __GFP_IO)
> +		dm_bufio_lock(c);
> +	else if (!dm_bufio_trylock(c))
> +		return 0;
>  
> +	count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
>  	dm_bufio_unlock(c);
> +	return count;
>  
> -	return r;
>  }
>  
>  /*
> @@ -1516,7 +1534,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
>  	__cache_size_refresh();
>  	mutex_unlock(&dm_bufio_clients_lock);
>  
> -	c->shrinker.shrink = shrink;
> +	c->shrinker.count_objects = dm_bufio_shrink_count;
> +	c->shrinker.scan_objects = dm_bufio_shrink_scan;
>  	c->shrinker.seeks = 1;
>  	c->shrinker.batch = 0;
>  	register_shrinker(&c->shrinker);
> @@ -1603,7 +1622,7 @@ static void cleanup_old_buffers(void)
>  			struct dm_buffer *b;
>  			b = list_entry(c->lru[LIST_CLEAN].prev,
>  				       struct dm_buffer, lru_list);
> -			if (__cleanup_old_buffer(b, 0, max_age * HZ))
> +			if (!__cleanup_old_buffer(b, 0, max_age * HZ))
>  				break;
>  			dm_bufio_cond_resched();
>  		}
> diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
> index 634b9ae..30f9f8e 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -341,27 +341,28 @@ out:
>  /*
>   * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
>   *
> - * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
> - * many objects (pages) we have in total.
> + * 'nr_to_scan' is the number of objects to scan for freeing.
>   *
>   * 'gfp_mask' is the mask of the allocation that got us into this mess.
>   *
> - * Return value is the number of objects (pages) remaining, or -1 if we cannot
> + * Return value is the number of objects freed or -1 if we cannot
>   * proceed without risk of deadlock (due to gfp_mask).
>   *
>   * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
>   * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
>   * pages freed.
>   */
> -static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
> +static long
> +ashmem_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
>  {
>  	struct ashmem_range *range, *next;
> +	long freed = 0;
>  
>  	/* We might recurse into filesystem code, so bail out if necessary */
> -	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> +	if (!(sc->gfp_mask & __GFP_FS))
>  		return -1;
> -	if (!sc->nr_to_scan)
> -		return lru_count;
>  
>  	mutex_lock(&ashmem_mutex);
>  	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
> @@ -374,17 +375,34 @@ static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
>  		range->purged = ASHMEM_WAS_PURGED;
>  		lru_del(range);
>  
> -		sc->nr_to_scan -= range_size(range);
> -		if (sc->nr_to_scan <= 0)
> +		freed += range_size(range);
> +		if (--sc->nr_to_scan <= 0)
>  			break;
>  	}
>  	mutex_unlock(&ashmem_mutex);
> +	return freed;
> +}
>  
> +static long
> +ashmem_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
> +{
> +	/*
> +	 * note that lru_count is count of pages on the lru, not a count of
> +	 * objects on the list. This means the scan function needs to return the
> +	 * number of pages freed, not the number of objects scanned.
> +	 */
>  	return lru_count;
>  }
>  
>  static struct shrinker ashmem_shrinker = {
> -	.shrink = ashmem_shrink,
> +	.count_objects = ashmem_shrink_count,
> +	.scan_objects = ashmem_shrink_scan,
> +	/*
> +	 * XXX (dchinner): I wish people would comment on why they need on
> +	 * significant changes to the default value here
> +	 */
>  	.seeks = DEFAULT_SEEKS * 4,
>  };
>  
> @@ -671,11 +689,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  		if (capable(CAP_SYS_ADMIN)) {
>  			struct shrink_control sc = {
>  				.gfp_mask = GFP_KERNEL,
> -				.nr_to_scan = 0,
> +				.nr_to_scan = LONG_MAX,
>  			};
> -			ret = ashmem_shrink(&ashmem_shrinker, &sc);
> -			sc.nr_to_scan = ret;
> -			ashmem_shrink(&ashmem_shrinker, &sc);
> +			ashmem_shrink_scan(&ashmem_shrinker, &sc);
>  		}
>  		break;
>  	}
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index b91e4bc..2bf2c2f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -63,11 +63,19 @@ static unsigned long lowmem_deathpending_timeout;
>  			printk(x);			\
>  	} while (0)
>  
> -static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
> +/*
> + * XXX (dchinner): this should all be using longs, not ints, as
> + * functions like global_page_state, get_mm_rss, etc all return longs or
> + * unsigned longs. Even the shrinker now uses longs....
> + */
> +static long
> +lowmem_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
>  {
>  	struct task_struct *tsk;
>  	struct task_struct *selected = NULL;
> -	int rem = 0;
> +	long freed = 0;
>  	int tasksize;
>  	int i;
>  	int min_score_adj = OOM_SCORE_ADJ_MAX + 1;
> @@ -89,19 +97,17 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
>  			break;
>  		}
>  	}
> -	if (sc->nr_to_scan > 0)
> -		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
> -				sc->nr_to_scan, sc->gfp_mask, other_free,
> -				other_file, min_score_adj);
> -	rem = global_page_state(NR_ACTIVE_ANON) +
> -		global_page_state(NR_ACTIVE_FILE) +
> -		global_page_state(NR_INACTIVE_ANON) +
> -		global_page_state(NR_INACTIVE_FILE);
> -	if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
> -		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> -			     sc->nr_to_scan, sc->gfp_mask, rem);
> -		return rem;
> +	lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
> +			sc->nr_to_scan, sc->gfp_mask, other_free,
> +			other_file, min_score_adj);
> +
> +	if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
> +		/* nothing to do, no point in calling again */
> +		lowmem_print(5, "lowmem_shrink %lu, %x, return -1\n",
> +			     sc->nr_to_scan, sc->gfp_mask);
> +		return -1;
>  	}
> +
>  	selected_oom_score_adj = min_score_adj;
>  
>  	rcu_read_lock();
> @@ -151,16 +157,32 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
>  		lowmem_deathpending_timeout = jiffies + HZ;
>  		send_sig(SIGKILL, selected, 0);
>  		set_tsk_thread_flag(selected, TIF_MEMDIE);
> -		rem -= selected_tasksize;
> +		freed += selected_tasksize;
>  	}
> -	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
> -		     sc->nr_to_scan, sc->gfp_mask, rem);
> +	lowmem_print(4, "lowmem_shrink %lu, %x, return %ld\n",
> +		     sc->nr_to_scan, sc->gfp_mask, freed);
>  	rcu_read_unlock();
> -	return rem;
> +	return freed;
> +}
> +
> +static long
> +lowmem_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
> +{
> +	long count;
> +
> +	count = global_page_state(NR_ACTIVE_ANON) +
> +		global_page_state(NR_ACTIVE_FILE) +
> +		global_page_state(NR_INACTIVE_ANON) +
> +		global_page_state(NR_INACTIVE_FILE);
> +	return count;
>  }
>  
>  static struct shrinker lowmem_shrinker = {
> -	.shrink = lowmem_shrink,
> +	.count_objects = lowmem_shrink_count,
> +	.scan_objects = lowmem_shrink_scan,
> +	/* why can't we document magic numbers? */
>  	.seeks = DEFAULT_SEEKS * 16
>  };
>  
> diff --git a/drivers/staging/ramster/zcache-main.c b/drivers/staging/ramster/zcache-main.c
> index a09dd5c..7d50688 100644
> --- a/drivers/staging/ramster/zcache-main.c
> +++ b/drivers/staging/ramster/zcache-main.c
> @@ -1054,12 +1054,13 @@ static bool zcache_freeze;
>   * used by zcache to approximately the same as the total number of LRU_FILE
>   * pageframes in use.
>   */
> -static int shrink_zcache_memory(struct shrinker *shrink,
> -				struct shrink_control *sc)
> +static long
> +zcache_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)

Why the Style changes? If the code has a certain style there is no need
to alter just one function in it to be at odd with the rest.
>  {
>  	static bool in_progress;
> -	int ret = -1;
> -	int nr = sc->nr_to_scan;
> +	long freed = 0;
>  	int nr_evict = 0;
>  	int nr_unuse = 0;
>  	struct page *page;
> @@ -1067,15 +1068,23 @@ static int shrink_zcache_memory(struct shrinker *shrink,
>  	int unuse_ret;
>  #endif
>  
> -	if (nr <= 0)
> -		goto skip_evict;
> +	/*
> +	 * XXX (dchinner): My kingdom for a mutex! There's no way this should
> +	 * ever be allowed to move out of staging until it supports concurrent
> +	 * shrinkers correctly.
> +	 *
> +	 * This whole shrinker is making me want to claw my eyes out. It has no
> +	 * redeeming values whatsoever and I can't undo the damage it has
> +	 * already done to my brain.

I am sad to hear that your brain has been damaged, but I fear that part
of working for open source is that there is no warranty and it can be
hazardous to your health.

How would you add redeeming values here? Perhaps that should
be added in the TODO file in the root directory of the driver. That is
the usual policy that Greg wants.

Is it that you would like to insert a mutex here? The underlaying code
(zcache_evict_eph_pageframe) ends up taking a spinlock.

> +	 */
>  
>  	/* don't allow more than one eviction thread at a time */
>  	if (in_progress)
> -		goto skip_evict;
> +		return -1;
>  
>  	in_progress = true;
>  
> +

Hm..
>  	/* we are going to ignore nr, and target a different value */
>  	zcache_last_active_file_pageframes =
>  		global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
> @@ -1083,11 +1092,13 @@ static int shrink_zcache_memory(struct shrinker *shrink,
>  		global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
>  	nr_evict = zcache_eph_pageframes - zcache_last_active_file_pageframes +
>  		zcache_last_inactive_file_pageframes;
> +

Hmmm?
>  	while (nr_evict-- > 0) {
>  		page = zcache_evict_eph_pageframe();
>  		if (page == NULL)
>  			break;
>  		zcache_free_page(page);
> +		freed++;
>  	}
>  
>  	zcache_last_active_anon_pageframes =
> @@ -1104,25 +1115,42 @@ static int shrink_zcache_memory(struct shrinker *shrink,
>  		unuse_ret = zcache_frontswap_unuse();
>  		if (unuse_ret == -ENOMEM)
>  			break;
> +		freed++;
>  	}
>  #endif
>  	in_progress = false;
> +	return freed;
> +}
> +
> +
> +/*
> + * XXX (dchinner): the shrinker updates global variables? You've got to be
> + * kidding me! And the object count can (apparently) go negative - that's
> + * *always* a bug so be bloody noisy about it.

The reason for this is that the ramster code shares code with zcache.
The mechanism to find out whether a page has been freed or not is
currently via these counters - which are actually ssize_t so are silly.

Perhaps you could expand the comment to mention a better mechanism for
keeping that value non-global and atomic?

> + */
> +static long
> +zcache_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)

Please, not style guide changes.

> +{
> +	long count;
>  
> -skip_evict:
> -	/* resample: has changed, but maybe not all the way yet */
>  	zcache_last_active_file_pageframes =
>  		global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
>  	zcache_last_inactive_file_pageframes =
>  		global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
> -	ret = zcache_eph_pageframes - zcache_last_active_file_pageframes +
> -		zcache_last_inactive_file_pageframes;
> -	if (ret < 0)
> -		ret = 0;
> -	return ret;
> +
> +	count = zcache_last_active_file_pageframes +
> +		zcache_last_inactive_file_pageframes +
> +		zcache_eph_pageframes;

Why not just use the 'free' value?
> +
> +	WARN_ON_ONCE(count < 0);
> +	return count < 0 ? 0 : count;
>  }
>  
>  static struct shrinker zcache_shrinker = {
> -	.shrink = shrink_zcache_memory,
> +	.count_objects = zcache_shrink_count,
> +	.scan_objects = zcache_shrink_scan,
>  	.seeks = DEFAULT_SEEKS,
>  };
>  
> diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
> index 52b43b7..d17ab5d 100644
> --- a/drivers/staging/zcache/zcache-main.c
> +++ b/drivers/staging/zcache/zcache-main.c

What Linux version is this based on? I am not seeing this in v3.10-rc4 ?

> @@ -536,10 +536,11 @@ static void zbud_evict_zbpg(struct zbud_page *zbpg)
>   * page in use by another cpu, but also to avoid potential deadlock due to
>   * lock inversion.
>   */
> -static void zbud_evict_pages(int nr)
> +static long zbud_evict_pages(int nr)
>  {
>  	struct zbud_page *zbpg;
>  	int i;
> +	long freed = 0;
>  
>  	/* first try freeing any pages on unused list */
>  retry_unused_list:
> @@ -554,6 +555,7 @@ retry_unused_list:
>  		spin_unlock_bh(&zbpg_unused_list_spinlock);
>  		zcache_free_page(zbpg);
>  		zcache_evicted_raw_pages++;
> +		freed++;
>  		if (--nr <= 0)
>  			goto out;
>  		goto retry_unused_list;
> @@ -578,6 +580,7 @@ retry_unbud_list_i:
>  			/* want budlists unlocked when doing zbpg eviction */
>  			zbud_evict_zbpg(zbpg);
>  			local_bh_enable();
> +			freed++;
>  			if (--nr <= 0)
>  				goto out;
>  			goto retry_unbud_list_i;
> @@ -602,13 +605,14 @@ retry_bud_list:
>  		/* want budlists unlocked when doing zbpg eviction */
>  		zbud_evict_zbpg(zbpg);
>  		local_bh_enable();
> +		freed++;
>  		if (--nr <= 0)
>  			goto out;
>  		goto retry_bud_list;
>  	}
>  	spin_unlock_bh(&zbud_budlists_spinlock);
>  out:
> -	return;
> +	return freed;
>  }
>  
>  static void __init zbud_init(void)
> @@ -1527,26 +1531,28 @@ static bool zcache_freeze;
>  /*
>   * zcache shrinker interface (only useful for ephemeral pages, so zbud only)
>   */
> -static int shrink_zcache_memory(struct shrinker *shrink,
> -				struct shrink_control *sc)
> +static long
> +zcache_shrink_scan(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)

Please, no style guide changes.
>  {
> -	int ret = -1;
> -	int nr = sc->nr_to_scan;
> -	gfp_t gfp_mask = sc->gfp_mask;
> +	if (!(sc->gfp_mask & __GFP_FS))
> +		return -1;
>  
> -	if (nr >= 0) {
> -		if (!(gfp_mask & __GFP_FS))
> -			/* does this case really need to be skipped? */
> -			goto out;
> -		zbud_evict_pages(nr);
> -	}
> -	ret = (int)atomic_read(&zcache_zbud_curr_raw_pages);
> -out:
> -	return ret;
> +	return zbud_evict_pages(sc->nr_to_scan);
> +}
> +
> +static long
> +zcache_shrink_count(
> +	struct shrinker		*shrink,
> +	struct shrink_control	*sc)
> +{
> +	return (long)atomic_read(&zcache_zbud_curr_raw_pages);
>  }
>  
>  static struct shrinker zcache_shrinker = {
> -	.shrink = shrink_zcache_memory,
> +	.count_objects = zcache_shrink_count,
> +	.scan_objects = zcache_shrink_scan,
>  	.seeks = DEFAULT_SEEKS,
>  };
>  
> -- 
> 1.7.10
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2013-06-07 13:37 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
2012-11-27 23:14 ` [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters Dave Chinner
2012-11-27 23:14 ` [PATCH 02/19] dentry: move to per-sb LRU locks Dave Chinner
2012-11-27 23:14 ` [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list Dave Chinner
2012-11-27 23:14 ` [PATCH 04/19] mm: new shrinker API Dave Chinner
2012-11-27 23:14 ` [PATCH 05/19] shrinker: convert superblock shrinkers to new API Dave Chinner
2012-12-20 11:06   ` Glauber Costa
2012-12-21  1:46     ` Dave Chinner
2012-12-21 10:17       ` Glauber Costa
2012-11-27 23:14 ` [PATCH 06/19] list: add a new LRU list type Dave Chinner
2012-11-28 16:10   ` Christoph Hellwig
2012-11-27 23:14 ` [PATCH 07/19] inode: convert inode lru list to generic lru list code Dave Chinner
2012-11-27 23:14 ` [PATCH 08/19] dcache: convert to use new lru list infrastructure Dave Chinner
2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
2012-12-20 11:21   ` Glauber Costa
2012-12-21  1:54     ` Dave Chinner
2013-01-16 19:21   ` Glauber Costa
2013-01-16 22:55     ` Dave Chinner
2013-01-17  0:35       ` Glauber Costa
2013-01-17  4:22         ` Dave Chinner
2013-01-17 18:21           ` Glauber Costa
2013-01-18  0:10             ` Dave Chinner
2013-01-18  0:14               ` Glauber Costa
2013-01-18  8:11                 ` Dave Chinner
2013-01-18 19:10                   ` Glauber Costa
2013-01-19  0:10                     ` Dave Chinner
2013-01-19  0:13                       ` Glauber Costa
2013-01-18  0:51               ` Glauber Costa
2013-01-18  8:08                 ` Dave Chinner
2013-01-18 19:01                   ` Glauber Costa
2012-11-27 23:14 ` [PATCH 10/19] shrinker: add node awareness Dave Chinner
2012-11-27 23:14 ` [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware Dave Chinner
2012-11-27 23:14 ` [PATCH 12/19] xfs: convert buftarg LRU to generic code Dave Chinner
2012-11-27 23:14 ` [PATCH 13/19] xfs: Node aware direct inode reclaim Dave Chinner
2012-11-27 23:14 ` [PATCH 14/19] xfs: use generic AG walk for background " Dave Chinner
2012-11-27 23:14 ` [PATCH 15/19] xfs: convert dquot cache lru to list_lru Dave Chinner
2012-11-28 16:17   ` Christoph Hellwig
2012-11-27 23:14 ` [PATCH 16/19] fs: convert fs shrinkers to new scan/count API Dave Chinner
2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
2012-11-28  1:13   ` Chris Wilson
2012-11-28  3:17     ` Dave Chinner
2012-11-28  8:21       ` Glauber Costa
2012-11-28 21:28         ` Dave Chinner
2012-11-29 10:29           ` Glauber Costa
2012-11-29 22:02             ` Dave Chinner
2013-06-07 13:37   ` Konrad Rzeszutek Wilk
2012-11-27 23:14 ` [PATCH 18/19] shrinker: convert remaining shrinkers to " Dave Chinner
2012-11-27 23:14 ` [PATCH 19/19] shrinker: Kill old ->shrink API Dave Chinner
2012-11-29 19:02 ` [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Andi Kleen
2012-11-29 22:09   ` Dave Chinner
2012-12-20 11:45 ` Glauber Costa
2012-12-21  2:50   ` Dave Chinner
2012-12-21 10:41     ` Glauber Costa
2013-01-21 16:08 ` Glauber Costa
2013-01-21 23:21   ` Dave Chinner
2013-01-23 14:36     ` Glauber Costa
2013-01-23 23:46       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).