linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Per-superblock shrinkers
@ 2010-05-14  7:24 Dave Chinner
  2010-05-14  7:24 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-14  7:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, linux-fsdevel, linux-mm


This series reworks the filesystem shrinkers. We currently have a
set of issues with the current filesystem shrinkers:

	1. There is an dependency between dentry and inode cache
	   shrinking that is only implicitly defined by the order of
	   shrinker registration.
	2. The shrinkers need to walk the superblock list and pin
	   the superblock to avoid unmount races with the sb going
	   away.
	3. The dentry cache uses per-superblock LRUs and proportions
	   reclaim between all the superblocks which means we are
	   doing breadth based reclaim. This means we touch every
	   superblock for every shrinker call, and may only reclaim
	   a single dentry at a time from a given superblock.
	4. The inode cache has a global LRU, so it has different
	   reclaim patterns to the dentry cache, despite the fact
	   that the dentry cache is generally the only thing that
	   pins inodes in memory.
	5. Filesystems need to register their own shrinkers for
	   caches and can't co-ordinate them with the dentry and
	   inode cache shrinkers.

The series starts by converting the inode cache to per-superblock
LRUs and changes the shrinker to match the dentry cache (#4).

It then adds a context to the shrinker callouts by passing the
shrinker structure with the callout. With this, a shrinker structure
is added to the superblock structure and a per-superblock shrinker
is registered.  Both the inode and dentry caches are modified to
shrunk via the superblock shrinker, and this directly encodes the
dcache/icache dependency inside the shrinker (#1).

This shrinker structure also avoids the need to pin the superblock
inside the shrinker because the shrinker is unregistered before the
superblock is freed (#2). Further, it pushes the proportioning of
reclaim between superblocks back up into the shrinker and batches
all the reclaim from a superblock into a tight call loop until the
shrink cycle for that superblock is complete. This effectively
converts reclaim to a depth-based reclaim mechanism which has a
smaller CPU cache footprint than the current mechanism (#3).

Then a pair of superblock operations that can be used to implement
filesystem specific cache reclaim is added. This is split into two
operations we don't need to overload the number of objects to scan
to indicate that a count should be returned.

Finally, the XFS inode cache shrinker is converted to use these
superblock operations, removing the need to register a shrinker,
keep a global list of XFS filesystems and locking to access the
per-filesystem caches. This fixes several new lockdep warnings the
XFS shrinker introduces because of the different contexts the
shrinker is called in, and allows for correct proportioning of
reclaim between the dentry, inode and XFS inode caches on the
filesystem to be executed (#5).

 arch/x86/kvm/mmu.c              |    2 +-
 drivers/gpu/drm/i915/i915_gem.c |    2 +-
 fs/dcache.c                     |  137 ++++++---------------------------------
 fs/fs-writeback.c               |    2 +-
 fs/gfs2/glock.c                 |    2 +-
 fs/gfs2/quota.c                 |    2 +-
 fs/gfs2/quota.h                 |    2 +-
 fs/inode.c                      |   64 ++++++-------------
 fs/mbcache.c                    |    5 +-
 fs/nfs/dir.c                    |    2 +-
 fs/nfs/internal.h               |    3 +-
 fs/quota/dquot.c                |    2 +-
 fs/super.c                      |   68 +++++++++++++++++++
 fs/ubifs/shrinker.c             |    2 +-
 fs/ubifs/ubifs.h                |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c      |    5 +-
 fs/xfs/linux-2.6/xfs_super.c    |   23 +++++--
 fs/xfs/linux-2.6/xfs_sync.c     |  123 +++++++++-------------------------
 fs/xfs/linux-2.6/xfs_sync.h     |   16 +++--
 fs/xfs/quota/xfs_qm.c           |    7 ++-
 fs/xfs/quota/xfs_qm_syscalls.c  |    2 +-
 fs/xfs/xfs_mount.h              |    1 -
 include/linux/fs.h              |   22 ++++++
 include/linux/mm.h              |    2 +-
 include/linux/writeback.h       |    1 -
 mm/vmscan.c                     |    8 ++-
 26 files changed, 220 insertions(+), 287 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
@ 2010-05-14  7:24 ` Dave Chinner
  2010-05-14  7:24 ` [PATCH 2/5] mm: add context argument to shrinker callback Dave Chinner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-14  7:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimatin schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |    2 +-
 fs/inode.c                |   89 ++++++++++++++++++++++++++++++++++++++++-----
 fs/super.c                |    1 +
 include/linux/fs.h        |    4 ++
 include/linux/writeback.h |    1 -
 5 files changed, 85 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4b37f7c..fd78854 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -547,7 +547,7 @@ select_queue:
 			/*
 			 * The inode is clean, unused
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_list, &inode->i_sb->s_inode_lru);
 		}
 	}
 	inode_sync_complete(inode);
diff --git a/fs/inode.c b/fs/inode.c
index 407bf39..8b95b15 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -74,7 +74,6 @@ static unsigned int i_hash_shift __read_mostly;
  */
 
 LIST_HEAD(inode_in_use);
-LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 
 /*
@@ -294,6 +293,7 @@ void __iget(struct inode *inode)
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
 	inodes_stat.nr_unused--;
+	inode->i_sb->s_nr_inodes_unused--;
 }
 
 /**
@@ -388,6 +388,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
+			inode->i_sb->s_nr_inodes_unused--;
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			count++;
@@ -446,32 +447,31 @@ static int can_unuse(struct inode *inode)
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
+ * the front of the sb->s_inode_lru list.  So look for it there and if the
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
-static void prune_icache(int nr_to_scan)
+static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
 	int nr_scanned;
 	unsigned long reap = 0;
 
-	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
-	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
+	for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
 		struct inode *inode;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&sb->s_inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(sb->s_inode_lru.prev, struct inode, i_list);
 
 		if (inode->i_state || atomic_read(&inode->i_count)) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_list, &sb->s_inode_lru);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -483,7 +483,7 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 
-			if (inode != list_entry(inode_unused.next,
+			if (inode != list_entry(sb->s_inode_lru.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
 			if (!can_unuse(inode))
@@ -495,13 +495,80 @@ static void prune_icache(int nr_to_scan)
 		nr_pruned++;
 	}
 	inodes_stat.nr_unused -= nr_pruned;
+	sb->s_nr_inodes_unused -= nr_pruned;
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
+	*nr_to_scan = nr_scanned;
 
 	dispose_list(&freeable);
+}
+
+static void prune_icache(int count)
+{
+	struct super_block *sb;
+	int w_count;
+	int unused = inodes_stat.nr_unused;
+	int prune_ratio;
+	int pruned;
+
+	if (unused == 0 || count == 0)
+		return;
+	down_read(&iprune_sem);
+restart:
+	if (count >= unused)
+		prune_ratio = 1;
+	else
+		prune_ratio = unused / count;
+	spin_lock(&sb_lock);
+	list_for_each_entry(sb, &super_blocks, s_list) {
+		if (sb->s_nr_inodes_unused == 0)
+			continue;
+		sb->s_count++;
+		/* Now, we reclaim unused dentrins with fairness.
+		 * We reclaim them same percentage from each superblock.
+		 * We calculate number of dentries to scan on this sb
+		 * as follows, but the implementation is arranged to avoid
+		 * overflows:
+		 * number of dentries to scan on this sb =
+		 * count * (number of dentries on this sb /
+		 * number of dentries in the machine)
+		 */
+		spin_unlock(&sb_lock);
+		if (prune_ratio != 1)
+			w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
+		else
+			w_count = sb->s_nr_inodes_unused;
+		pruned = w_count;
+		/*
+		 * We need to be sure this filesystem isn't being unmounted,
+		 * otherwise we could race with generic_shutdown_super(), and
+		 * end up holding a reference to an inode while the filesystem
+		 * is unmounted.  So we try to get s_umount, and make sure
+		 * s_root isn't NULL.
+		 */
+		if (down_read_trylock(&sb->s_umount)) {
+			if ((sb->s_root != NULL) &&
+			    (!list_empty(&sb->s_dentry_lru))) {
+				shrink_icache_sb(sb, &w_count);
+				pruned -= w_count;
+			}
+			up_read(&sb->s_umount);
+		}
+		spin_lock(&sb_lock);
+		count -= pruned;
+		/*
+		 * restart only when sb is no longer on the list and
+		 * we have more work to do.
+		 */
+		if (__put_super_and_need_restart(sb) && count > 0) {
+			spin_unlock(&sb_lock);
+			goto restart;
+		}
+	}
+	spin_unlock(&sb_lock);
 	up_read(&iprune_sem);
 }
 
@@ -1242,8 +1309,9 @@ int generic_detach_inode(struct inode *inode)
 
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_list, &sb->s_inode_lru);
 		inodes_stat.nr_unused++;
+		sb->s_nr_inodes_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode_lock);
 			return 0;
@@ -1256,6 +1324,7 @@ int generic_detach_inode(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
+		sb->s_nr_inodes_unused--;
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
diff --git a/fs/super.c b/fs/super.c
index 1527e6a..18655e6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -68,6 +68,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
+		INIT_LIST_HEAD(&s->s_inode_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 44f35ae..41132e3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1348,6 +1348,10 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
+	/* s_inode_lru and s_nr_inodes_unused are protected by inode_lock */
+	struct list_head	s_inode_lru;	/* unused inode lru */
+	int			s_nr_inodes_unused;	/* # of inodes on lru */
+
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 36520de..2636ade 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,7 +11,6 @@ struct backing_dev_info;
 
 extern spinlock_t inode_lock;
 extern struct list_head inode_in_use;
-extern struct list_head inode_unused;
 
 /*
  * fs/fs-writeback.c
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] mm: add context argument to shrinker callback
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
  2010-05-14  7:24 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
@ 2010-05-14  7:24 ` Dave Chinner
  2010-05-14  7:24 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-14  7:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c              |    2 +-
 drivers/gpu/drm/i915/i915_gem.c |    2 +-
 fs/dcache.c                     |    2 +-
 fs/gfs2/glock.c                 |    2 +-
 fs/gfs2/quota.c                 |    2 +-
 fs/gfs2/quota.h                 |    2 +-
 fs/inode.c                      |    2 +-
 fs/mbcache.c                    |    5 +++--
 fs/nfs/dir.c                    |    2 +-
 fs/nfs/internal.h               |    3 ++-
 fs/quota/dquot.c                |    2 +-
 fs/ubifs/shrinker.c             |    2 +-
 fs/ubifs/ubifs.h                |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c      |    5 +++--
 fs/xfs/linux-2.6/xfs_sync.c     |    1 +
 fs/xfs/quota/xfs_qm.c           |    7 +++++--
 include/linux/mm.h              |    2 +-
 mm/vmscan.c                     |    8 +++++---
 18 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 19a8906..03e689d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2918,7 +2918,7 @@ static void kvm_mmu_remove_one_alloc_mmu_page(struct kvm *kvm)
 	kvm_mmu_zap_page(kvm, page);
 }
 
-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
 	struct kvm *kvm_freed = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ef3d91d..cc436ba 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5185,7 +5185,7 @@ void i915_gem_release(struct drm_device * dev, struct drm_file *file_priv)
 }
 
 static int
-i915_gem_shrink(int nr_to_scan, gfp_t gfp_mask)
+i915_gem_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	drm_i915_private_t *dev_priv, *next_dev;
 	struct drm_i915_gem_object *obj_priv, *next_obj;
diff --git a/fs/dcache.c b/fs/dcache.c
index f1358e5..41c35c1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -897,7 +897,7 @@ EXPORT_SYMBOL(shrink_dcache_parent);
  *
  * In this case we return -1 to tell the caller that we baled.
  */
-static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 454d4b4..e947661 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1345,7 +1345,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
 }
 
 
-static int gfs2_shrink_glock_memory(int nr, gfp_t gfp_mask)
+static int gfs2_shrink_glock_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	struct gfs2_glock *gl;
 	int may_demote;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 6dbcbad..3f2cf67 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -77,7 +77,7 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask)
+int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 195f60c..e7d236c 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -51,7 +51,7 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask);
+extern int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
 #endif /* __QUOTA_DOT_H__ */
diff --git a/fs/inode.c b/fs/inode.c
index 8b95b15..b292e41 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -581,7 +581,7 @@ restart:
  * This function is passed the number of inodes to scan, and it returns the
  * total number of remaining possibly-reclaimable inodes.
  */
-static int shrink_icache_memory(int nr, gfp_t gfp_mask)
+static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		/*
diff --git a/fs/mbcache.c b/fs/mbcache.c
index ec88ff3..e28f21b 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -115,7 +115,7 @@ mb_cache_indexes(struct mb_cache *cache)
  * What the mbcache registers as to get shrunk dynamically.
  */
 
-static int mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask);
+static int mb_cache_shrink_fn(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask);
 
 static struct shrinker mb_cache_shrinker = {
 	.shrink = mb_cache_shrink_fn,
@@ -191,13 +191,14 @@ forget:
  * This function is called by the kernel memory management when memory
  * gets low.
  *
+ * @shrink: (ignored)
  * @nr_to_scan: Number of objects to scan
  * @gfp_mask: (ignored)
  *
  * Returns the number of objects which are present in the cache.
  */
 static int
-mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask)
+mb_cache_shrink_fn(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	LIST_HEAD(free_list);
 	struct list_head *l, *ltmp;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a7bb5c6..6a0a6f7 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1669,7 +1669,7 @@ static void nfs_access_free_entry(struct nfs_access_entry *entry)
 	smp_mb__after_atomic_dec();
 }
 
-int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
+int nfs_access_cache_shrinker(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 11f82f0..aa022eb 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -205,7 +205,8 @@ extern struct rpc_procinfo nfs4_procedures[];
 void nfs_close_context(struct nfs_open_context *ctx, int is_sync);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask);
+extern int nfs_access_cache_shrinker(struct shrinker *shrink,
+					int nr_to_scan, gfp_t gfp_mask);
 
 /* inode.c */
 extern struct workqueue_struct *nfsiod_workqueue;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 788b580..187e3f2 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -668,7 +668,7 @@ static void prune_dqcache(int count)
  * more memory
  */
 
-static int shrink_dqcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dqcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		spin_lock(&dq_list_lock);
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index 02feb59..0b20111 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -277,7 +277,7 @@ static int kick_a_thread(void)
 	return 0;
 }
 
-int ubifs_shrinker(int nr, gfp_t gfp_mask)
+int ubifs_shrinker(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	int freed, contention = 0;
 	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index bd2542d..5a92345 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1575,7 +1575,7 @@ int ubifs_tnc_start_commit(struct ubifs_info *c, struct ubifs_zbranch *zroot);
 int ubifs_tnc_end_commit(struct ubifs_info *c);
 
 /* shrinker.c */
-int ubifs_shrinker(int nr_to_scan, gfp_t gfp_mask);
+int ubifs_shrinker(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask);
 
 /* commit.c */
 int ubifs_bg_thread(void *info);
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 44c2b0e..d2cfc54 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -44,7 +44,7 @@
 
 static kmem_zone_t *xfs_buf_zone;
 STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(int, gfp_t);
+STATIC int xfsbufd_wakeup(struct shrinker *, int, gfp_t);
 STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
 static struct shrinker xfs_buf_shake = {
 	.shrink = xfsbufd_wakeup,
@@ -339,7 +339,7 @@ _xfs_buf_lookup_pages(
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
-			xfsbufd_wakeup(0, gfp_mask);
+			xfsbufd_wakeup(NULL, 0, gfp_mask);
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
@@ -1756,6 +1756,7 @@ xfs_buf_runall_queues(
 
 STATIC int
 xfsbufd_wakeup(
+	struct shrinker		*shrink,
 	int			priority,
 	gfp_t			mask)
 {
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index a427c63..17ec4a6 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -879,6 +879,7 @@ static struct rw_semaphore xfs_mount_list_lock;
 
 static int
 xfs_reclaim_inode_shrink(
+	struct shrinker	*shrink,
 	int		nr_to_scan,
 	gfp_t		gfp_mask)
 {
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 417e61e..49c8d84 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -72,7 +72,7 @@ STATIC void	xfs_qm_freelist_destroy(xfs_frlist_t *);
 
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(int, gfp_t);
+STATIC int	xfs_qm_shake(struct shrinker *, int, gfp_t);
 
 static struct shrinker xfs_qm_shaker = {
 	.shrink = xfs_qm_shake,
@@ -2088,7 +2088,10 @@ xfs_qm_shake_freelist(
  */
 /* ARGSUSED */
 STATIC int
-xfs_qm_shake(int nr_to_scan, gfp_t gfp_mask)
+xfs_qm_shake(
+	struct shrinker	*shrink,
+	int		nr_to_scan,
+	gfp_t		gfp_mask)
 {
 	int	ndqused, nfree, n;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 462acaf..ff4c44e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -995,7 +995,7 @@ static inline void sync_mm_rss(struct task_struct *task, struct mm_struct *mm)
  * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrinker {
-	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
+	int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask);
 	int seeks;	/* seeks to recreate an obj */
 
 	/* These are for internal use */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..9d56aaf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -215,8 +215,9 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
 		unsigned long total_scan;
-		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
+		unsigned long max_pass;
 
+		max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask);
 		delta = (4 * scanned) / shrinker->seeks;
 		delta *= max_pass;
 		do_div(delta, lru_pages + 1);
@@ -244,8 +245,9 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 			int shrink_ret;
 			int nr_before;
 
-			nr_before = (*shrinker->shrink)(0, gfp_mask);
-			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
+			nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask);
+			shrink_ret = (*shrinker->shrink)(shrinker, this_scan,
+								gfp_mask);
 			if (shrink_ret == -1)
 				break;
 			if (shrink_ret < nr_before)
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
  2010-05-14  7:24 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
  2010-05-14  7:24 ` [PATCH 2/5] mm: add context argument to shrinker callback Dave Chinner
@ 2010-05-14  7:24 ` Dave Chinner
  2010-05-14  7:24 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-14  7:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, linux-fsdevel, linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 15596 bytes --]

From: Dave Chinner <dchinner@redhat.com>

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well.  The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c        |  137 ++++++++--------------------------------------------
 fs/inode.c         |  111 +++---------------------------------------
 fs/super.c         |   48 ++++++++++++++++++
 include/linux/fs.h |    7 +++
 4 files changed, 84 insertions(+), 219 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 41c35c1..2d619d3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
  * which flags are set. This means we don't need to maintain multiple
  * similar copies of this loop.
  */
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
 {
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
 	struct dentry *dentry;
-	int cnt = 0;
 
 	BUG_ON(!sb);
-	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
+	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
 	spin_lock(&dcache_lock);
-	if (count != NULL)
-		/* called from prune_dcache() and shrink_dcache_parent() */
-		cnt = *count;
-restart:
-	if (count == NULL)
+	if (count == -1)
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 	else {
 		while (!list_empty(&sb->s_dentry_lru)) {
@@ -492,13 +487,13 @@ restart:
 			} else {
 				list_move_tail(&dentry->d_lru, &tmp);
 				spin_unlock(&dentry->d_lock);
-				cnt--;
-				if (!cnt)
+				if (--count == 0)
 					break;
 			}
 			cond_resched_lock(&dcache_lock);
 		}
 	}
+prune_more:
 	while (!list_empty(&tmp)) {
 		dentry = list_entry(tmp.prev, struct dentry, d_lru);
 		dentry_lru_del_init(dentry);
@@ -516,91 +511,30 @@ restart:
 		/* dentry->d_lock was dropped in prune_one_dentry() */
 		cond_resched_lock(&dcache_lock);
 	}
-	if (count == NULL && !list_empty(&sb->s_dentry_lru))
-		goto restart;
-	if (count != NULL)
-		*count = cnt;
+	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
+		list_splice_init(&sb->s_dentry_lru, &tmp);
+		goto prune_more;
+	}
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lock);
 }
 
 /**
- * prune_dcache - shrink the dcache
- * @count: number of entries to try to free
+ * prune_dcache_sb - shrink the dcache
+ * @nr_to_scan: number of entries to try to free
  *
- * Shrink the dcache. This is done when we need more memory, or simply when we
- * need to unmount something (at which point we need to unuse all dentries).
+ * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
+ * done when we need more memory an called from the superblock shrinker
+ * function.
  *
- * This function may fail to free any resources if all the dentries are in use.
+ * This function may fail to free any resources if all the dentries are in
+ * use.
  */
-static void prune_dcache(int count)
-{
-	struct super_block *sb;
-	int w_count;
-	int unused = dentry_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
 
-	if (unused == 0 || count == 0)
-		return;
-	spin_lock(&dcache_lock);
-restart:
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_nr_dentry_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_dentry_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_dentry_lru))) {
-				spin_unlock(&dcache_lock);
-				__shrink_dcache_sb(sb, &w_count,
-						DCACHE_REFERENCED);
-				pruned -= w_count;
-				spin_lock(&dcache_lock);
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		/*
-		 * restart only when sb is no longer on the list and
-		 * we have more work to do.
-		 */
-		if (__put_super_and_need_restart(sb) && count > 0) {
-			spin_unlock(&sb_lock);
-			goto restart;
-		}
-	}
-	spin_unlock(&sb_lock);
-	spin_unlock(&dcache_lock);
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan)
+{
+	__shrink_dcache_sb(sb, nr_to_scan, DCACHE_REFERENCED);
 }
 
 /**
@@ -613,7 +547,7 @@ restart:
  */
 void shrink_dcache_sb(struct super_block * sb)
 {
-	__shrink_dcache_sb(sb, NULL, 0);
+	__shrink_dcache_sb(sb, -1, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -881,37 +815,10 @@ void shrink_dcache_parent(struct dentry * parent)
 	int found;
 
 	while ((found = select_parent(parent)) != 0)
-		__shrink_dcache_sb(sb, &found, 0);
+		__shrink_dcache_sb(sb, found, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
-/*
- * Scan `nr' dentries and return the number which remain.
- *
- * We need to avoid reentering the filesystem if the caller is performing a
- * GFP_NOFS allocation attempt.  One example deadlock is:
- *
- * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
- * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
- * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
- */
-static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_dcache(nr);
-	}
-	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker dcache_shrinker = {
-	.shrink = shrink_dcache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -2318,8 +2225,6 @@ static void __init dcache_init(void)
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
-	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff --git a/fs/inode.c b/fs/inode.c
index b292e41..ab8ce3a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -442,8 +442,10 @@ static int can_unuse(struct inode *inode)
 }
 
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -451,10 +453,10 @@ static int can_unuse(struct inode *inode)
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has metadata buffers attached to mapping->private_list then try
+ * to remove them.
  */
-static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
+void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -462,7 +464,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	unsigned long reap = 0;
 
 	spin_lock(&inode_lock);
-	for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
+	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
 		struct inode *inode;
 
 		if (list_empty(&sb->s_inode_lru))
@@ -501,106 +503,10 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
-	*nr_to_scan = nr_scanned;
 
 	dispose_list(&freeable);
 }
 
-static void prune_icache(int count)
-{
-	struct super_block *sb;
-	int w_count;
-	int unused = inodes_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	down_read(&iprune_sem);
-restart:
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_nr_inodes_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_inodes_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_dentry_lru))) {
-				shrink_icache_sb(sb, &w_count);
-				pruned -= w_count;
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		/*
-		 * restart only when sb is no longer on the list and
-		 * we have more work to do.
-		 */
-		if (__put_super_and_need_restart(sb) && count > 0) {
-			spin_unlock(&sb_lock);
-			goto restart;
-		}
-	}
-	spin_unlock(&sb_lock);
-	up_read(&iprune_sem);
-}
-
-/*
- * shrink_icache_memory() will attempt to reclaim some unused inodes.  Here,
- * "unused" means that no dentries are referring to the inodes: the files are
- * not open and the dcache references to those inodes have already been
- * reclaimed.
- *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
- */
-static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		/*
-		 * Nasty deadlock avoidance.  We may hold various FS locks,
-		 * and we don't want to recurse into the FS that called us
-		 * in clear_inode() and friends..
-		 */
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_icache(nr);
-	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker icache_shrinker = {
-	.shrink = shrink_icache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
@@ -1640,7 +1546,6 @@ void __init inode_init(void)
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
-	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/fs/super.c b/fs/super.c
index 18655e6..339b590 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -45,6 +45,50 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
+{
+	struct super_block *sb;
+	int count;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	/*
+	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
+	 * to recurse into the FS that called us in clear_inode() and friends..
+	 */
+	if (!(gfp_mask & __GFP_FS))
+		return -1;
+
+	/*
+	 * if we can't get the umount lock, then there's no point having the
+	 * shrinker try again because the sb is being torn down.
+	 */
+	if (!down_read_trylock(&sb->s_umount))
+		return -1;
+
+	if (!sb->s_root) {
+		up_read(&sb->s_umount);
+		return -1;
+	}
+
+	if (nr_to_scan) {
+		/* proportion the scan between the two cacheѕ */
+		int total;
+
+		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
+		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
+
+		/* prune dcache first as icache is pinned by it */
+		prune_dcache_sb(sb, count);
+		prune_icache_sb(sb, nr_to_scan - count);
+	}
+
+	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
+						* sysctl_vfs_cache_pressure;
+	up_read(&sb->s_umount);
+	return count;
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -106,6 +150,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
+		s->s_shrink.shrink = prune_super;
+		s->s_shrink.seeks = DEFAULT_SEEKS;
+		register_shrinker(&s->s_shrink);
 	}
 out:
 	return s;
@@ -119,6 +166,7 @@ out:
  */
 static inline void destroy_super(struct super_block *s)
 {
+	unregister_shrinker(&s->s_shrink);
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 41132e3..6ba3739 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -382,6 +382,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -1387,8 +1388,14 @@ struct super_block {
 	 * generic_show_options()
 	 */
 	char *s_options;
+
+	struct shrinker s_shrink;	/* per-sb shrinker handle */
 };
 
+/* superblock cache pruning functions */
+void prune_icache_sb(struct super_block *sb, int nr_to_scan);
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] superblock: add filesystem shrinker operations
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
                   ` (2 preceding siblings ...)
  2010-05-14  7:24 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
@ 2010-05-14  7:24 ` Dave Chinner
  2010-05-14  7:24 ` [PATCH 5/5] xfs: make use of new shrinker callout Dave Chinner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-14  7:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, linux-fsdevel, linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3797 bytes --]

From: Dave Chinner <dchinner@redhat.com>

Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.

Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/super.c         |   43 +++++++++++++++++++++++++++++++------------
 include/linux/fs.h |   11 +++++++++++
 2 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 339b590..e98292e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -48,7 +48,8 @@ DEFINE_SPINLOCK(sb_lock);
 static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	struct super_block *sb;
-	int count;
+	int	fs_objects = 0;
+	int	total_objects;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -71,22 +72,40 @@ static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 		return -1;
 	}
 
-	if (nr_to_scan) {
-		/* proportion the scan between the two cacheѕ */
-		int total;
-
-		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
-		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
+	if (sb->s_op && sb->s_op->nr_cached_objects)
+		fs_objects = sb->s_op->nr_cached_objects(sb);
 
-		/* prune dcache first as icache is pinned by it */
-		prune_dcache_sb(sb, count);
-		prune_icache_sb(sb, nr_to_scan - count);
+	total_objects = sb->s_nr_dentry_unused +
+			sb->s_nr_inodes_unused + fs_objects + 1;
+	if (nr_to_scan) {
+		int	dentries;
+		int	inodes;
+
+		/* proportion the scan between the cacheѕ */
+		dentries = (nr_to_scan * sb->s_nr_dentry_unused) /
+							total_objects;
+		inodes = (nr_to_scan * sb->s_nr_inodes_unused) /
+							total_objects;
+		if (fs_objects)
+			fs_objects = (nr_to_scan * fs_objects) /
+							total_objects;
+		/*
+		 * prune the dcache first as the icache is pinned by it, then
+		 * prune the icache, followed by the filesystem specific caches
+		 */
+		prune_dcache_sb(sb, dentries);
+		prune_icache_sb(sb, inodes);
+		if (sb->s_op && sb->s_op->free_cached_objects) {
+			sb->s_op->free_cached_objects(sb, fs_objects);
+			fs_objects = sb->s_op->nr_cached_objects(sb);
+		}
+		total_objects = sb->s_nr_dentry_unused +
+				sb->s_nr_inodes_unused + fs_objects;
 	}
 
-	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
-						* sysctl_vfs_cache_pressure;
+	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
 	up_read(&sb->s_umount);
-	return count;
+	return total_objects;
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6ba3739..ef2e9e2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1591,6 +1591,17 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+
+	/*
+	 * memory shrinker operations.
+	 * ->nr_cached_objects() should return the number of freeable cached
+	 * objects the filesystem holds.
+	 * ->free_cache_objects() should attempt to free the number of cached
+	 * objects indicated. It should return how many objects it attempted to
+	 * free.
+	 */
+	int (*nr_cached_objects)(struct super_block *);
+	int (*free_cached_objects)(struct super_block *, int);
 };
 
 /*
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] xfs: make use of new shrinker callout
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
                   ` (3 preceding siblings ...)
  2010-05-14  7:24 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
@ 2010-05-14  7:24 ` Dave Chinner
  2010-05-14 17:46 ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Christoph Lameter
  2010-05-15  1:30 ` [PATCH 0/5] Per-superblock shrinkers Al Viro
  6 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-14  7:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Convert the inode reclaim shrinker to use the new per-sb shrinker
operations.  This fixes a bunch of lockdep warnings about the
xfs_mount_list_lock being taken in different reclaim contexts by
removing it, and allows the reclaim to be proportioned across
filesystems with no extra code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c   |   23 ++++++--
 fs/xfs/linux-2.6/xfs_sync.c    |  124 +++++++++++-----------------------------
 fs/xfs/linux-2.6/xfs_sync.h    |   16 +++--
 fs/xfs/quota/xfs_qm_syscalls.c |    2 +-
 fs/xfs/xfs_mount.h             |    1 -
 5 files changed, 61 insertions(+), 105 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 29f1edc..2bd512a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1209,7 +1209,6 @@ xfs_fs_put_super(
 
 	xfs_unmountfs(mp);
 	xfs_freesb(mp);
-	xfs_inode_shrinker_unregister(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_close_devices(mp);
 	xfs_dmops_put(mp);
@@ -1623,8 +1622,6 @@ xfs_fs_fill_super(
 	if (error)
 		goto fail_vnrele;
 
-	xfs_inode_shrinker_register(mp);
-
 	kfree(mtpt);
 	return 0;
 
@@ -1678,6 +1675,22 @@ xfs_fs_get_sb(
 			   mnt);
 }
 
+static int
+xfs_fs_nr_cached_objects(
+	struct super_block	*sb)
+{
+	return xfs_reclaim_inodes_count(XFS_M(sb));
+}
+
+static int
+xfs_fs_free_cached_objects(
+	struct super_block	*sb,
+	int			nr_to_scan)
+{
+	xfs_reclaim_inodes_nr(XFS_M(sb), 0, nr_to_scan);
+	return 0;
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1691,6 +1704,8 @@ static const struct super_operations xfs_super_operations = {
 	.statfs			= xfs_fs_statfs,
 	.remount_fs		= xfs_fs_remount,
 	.show_options		= xfs_fs_show_options,
+	.nr_cached_objects	= xfs_fs_nr_cached_objects,
+	.free_cached_objects	= xfs_fs_free_cached_objects,
 };
 
 static struct file_system_type xfs_fs_type = {
@@ -1870,7 +1885,6 @@ init_xfs_fs(void)
 		goto out_cleanup_procfs;
 
 	vfs_initquota();
-	xfs_inode_shrinker_init();
 
 	error = register_filesystem(&xfs_fs_type);
 	if (error)
@@ -1898,7 +1912,6 @@ exit_xfs_fs(void)
 {
 	vfs_exitquota();
 	unregister_filesystem(&xfs_fs_type);
-	xfs_inode_shrinker_destroy();
 	xfs_sysctl_unregister();
 	xfs_cleanup_procfs();
 	xfs_buf_terminate();
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 17ec4a6..4f26673 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -137,7 +137,7 @@ restart:
 
 	} while ((*nr_to_scan)--);
 
-	if (skipped) {
+	if (skipped && *nr_to_scan > 0) {
 		delay(1);
 		goto restart;
 	}
@@ -152,14 +152,14 @@ xfs_inode_ag_iterator(
 	int			flags,
 	int			tag,
 	int			exclusive,
-	int			*nr_to_scan)
+	int			nr_to_scan)
 {
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			nr;
 
-	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
+	if (nr_to_scan <= 0)
+		nr_to_scan = INT_MAX;
 	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
 		struct xfs_perag	*pag;
 
@@ -169,18 +169,16 @@ xfs_inode_ag_iterator(
 			continue;
 		}
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive, &nr);
+						exclusive, &nr_to_scan);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
 			if (error == EFSCORRUPTED)
 				break;
 		}
-		if (nr <= 0)
+		if (nr_to_scan <= 0)
 			break;
 	}
-	if (nr_to_scan)
-		*nr_to_scan = nr;
 	return XFS_ERROR(last_error);
 }
 
@@ -299,7 +297,7 @@ xfs_sync_data(
 	ASSERT((flags & ~(SYNC_TRYLOCK|SYNC_WAIT)) == 0);
 
 	error = xfs_inode_ag_iterator(mp, xfs_sync_inode_data, flags,
-				      XFS_ICI_NO_TAG, 0, NULL);
+				      XFS_ICI_NO_TAG, 0, 0);
 	if (error)
 		return XFS_ERROR(error);
 
@@ -318,7 +316,7 @@ xfs_sync_attr(
 	ASSERT((flags & ~SYNC_WAIT) == 0);
 
 	return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags,
-				     XFS_ICI_NO_TAG, 0, NULL);
+				     XFS_ICI_NO_TAG, 0, 0);
 }
 
 STATIC int
@@ -858,100 +856,44 @@ reclaim:
 
 }
 
+/*
+ * Scan a certain number of inodes for reclaim. nr_to_scan <= 0 means reclaim
+ * every inode that has the reclaim tag set.
+ */
 int
-xfs_reclaim_inodes(
+xfs_reclaim_inodes_nr(
 	xfs_mount_t	*mp,
-	int		mode)
+	int		mode,
+	int		nr_to_scan)
 {
 	return xfs_inode_ag_iterator(mp, xfs_reclaim_inode, mode,
-					XFS_ICI_RECLAIM_TAG, 1, NULL);
+					XFS_ICI_RECLAIM_TAG, 1, nr_to_scan);
 }
 
 /*
- * Shrinker infrastructure.
+ * Return the number of reclaimable inodes in the filesystem for
+ * the shrinker to determine how much to reclaim.
  *
- * This is all far more complex than it needs to be. It adds a global list of
- * mounts because the shrinkers can only call a global context. We need to make
- * the shrinkers pass a context to avoid the need for global state.
+ * Because the inode cache may not have any reclaimable inodes in it, but will
+ * be populated as part of the higher level cleaning, we need to count all
+ * those inodes as reclaimable here as well.
  */
-static LIST_HEAD(xfs_mount_list);
-static struct rw_semaphore xfs_mount_list_lock;
-
-static int
-xfs_reclaim_inode_shrink(
-	struct shrinker	*shrink,
-	int		nr_to_scan,
-	gfp_t		gfp_mask)
+int
+xfs_reclaim_inodes_count(
+	xfs_mount_t	*mp)
 {
-	struct xfs_mount *mp;
-	struct xfs_perag *pag;
-	xfs_agnumber_t	ag;
-	int		reclaimable = 0;
-
-	if (nr_to_scan) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-
-		down_read(&xfs_mount_list_lock);
-		list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
-			xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
-					XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
-			if (nr_to_scan <= 0)
-				break;
-		}
-		up_read(&xfs_mount_list_lock);
-	}
-
-	down_read(&xfs_mount_list_lock);
-	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
-		for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+	xfs_agnumber_t		ag;
+	int			reclaimable = 0;
 
-			pag = xfs_perag_get(mp, ag);
-			if (!pag->pag_ici_init) {
-				xfs_perag_put(pag);
-				continue;
-			}
-			reclaimable += pag->pag_ici_reclaimable;
+	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+		struct xfs_perag *pag = xfs_perag_get(mp, ag);
+		if (!pag->pag_ici_init) {
 			xfs_perag_put(pag);
+			continue;
 		}
+		reclaimable += pag->pag_ici_reclaimable;
+		xfs_perag_put(pag);
 	}
-	up_read(&xfs_mount_list_lock);
-	return reclaimable;
-}
-
-static struct shrinker xfs_inode_shrinker = {
-	.shrink = xfs_reclaim_inode_shrink,
-	.seeks = DEFAULT_SEEKS,
-};
-
-void __init
-xfs_inode_shrinker_init(void)
-{
-	init_rwsem(&xfs_mount_list_lock);
-	register_shrinker(&xfs_inode_shrinker);
-}
-
-void
-xfs_inode_shrinker_destroy(void)
-{
-	ASSERT(list_empty(&xfs_mount_list));
-	unregister_shrinker(&xfs_inode_shrinker);
-}
-
-void
-xfs_inode_shrinker_register(
-	struct xfs_mount	*mp)
-{
-	down_write(&xfs_mount_list_lock);
-	list_add_tail(&mp->m_mplist, &xfs_mount_list);
-	up_write(&xfs_mount_list_lock);
+	return reclaimable + mp->m_super->s_nr_inodes_unused;
 }
 
-void
-xfs_inode_shrinker_unregister(
-	struct xfs_mount	*mp)
-{
-	down_write(&xfs_mount_list_lock);
-	list_del(&mp->m_mplist);
-	up_write(&xfs_mount_list_lock);
-}
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index cdcbaac..c55f645 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -43,7 +43,14 @@ void xfs_quiesce_attr(struct xfs_mount *mp);
 
 void xfs_flush_inodes(struct xfs_inode *ip);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+int xfs_reclaim_inodes_count(struct xfs_mount *mp);
+int xfs_reclaim_inodes_nr(struct xfs_mount *mp, int mode, int nr_to_scan);
+
+static inline int
+xfs_reclaim_inodes(struct xfs_mount *mp, int mode)
+{
+	return xfs_reclaim_inodes_nr(mp, mode, 0);
+}
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 void __xfs_inode_set_reclaim_tag(struct xfs_perag *pag, struct xfs_inode *ip);
@@ -53,11 +60,6 @@ void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp, struct xfs_perag *pag,
 int xfs_sync_inode_valid(struct xfs_inode *ip, struct xfs_perag *pag);
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags),
-	int flags, int tag, int write_lock, int *nr_to_scan);
-
-void xfs_inode_shrinker_init(void);
-void xfs_inode_shrinker_destroy(void);
-void xfs_inode_shrinker_register(struct xfs_mount *mp);
-void xfs_inode_shrinker_unregister(struct xfs_mount *mp);
+	int flags, int tag, int write_lock, int nr_to_scan);
 
 #endif
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index 50bee07..94c0cac 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -892,7 +892,7 @@ xfs_qm_dqrele_all_inodes(
 {
 	ASSERT(mp->m_quotainfo);
 	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags,
-				XFS_ICI_NO_TAG, 0, NULL);
+				XFS_ICI_NO_TAG, 0, 0);
 }
 
 /*------------------------------------------------------------------------*/
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9ff48a1..4fa0bc7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -259,7 +259,6 @@ typedef struct xfs_mount {
 	wait_queue_head_t	m_wait_single_sync_task;
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
-	struct list_head	m_mplist;	/* inode shrinker mount list */
 } xfs_mount_t;
 
 /*
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers)
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
                   ` (4 preceding siblings ...)
  2010-05-14  7:24 ` [PATCH 5/5] xfs: make use of new shrinker callout Dave Chinner
@ 2010-05-14 17:46 ` Christoph Lameter
  2010-05-14 20:36   ` Defrag in shrinkers Andi Kleen
  2010-05-15  1:15   ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Dave Chinner
  2010-05-15  1:30 ` [PATCH 0/5] Per-superblock shrinkers Al Viro
  6 siblings, 2 replies; 26+ messages in thread
From: Christoph Lameter @ 2010-05-14 17:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, xfs, linux-fsdevel, linux-mm, Pekka Enberg, npiggin

Would it also be possible to add some defragmentation logic when you
revise the shrinkers? Here is a prototype patch that would allow you to
determine the other objects sitting in the same page as a given object.

With that I hope that you have enough information to determine if its
worth to evict the other objects as well to reclaim the slab page.


From: Christoph Lameter <cl@linux-foundation.org>
Subject: Slab allocators: Introduce function to determine other objects in the same slab page

kmem_cache_objects() can be used to determin other objects sharing the same
slab. With such knowledge a slab user can intentionally free all slab objects
in a slab to allow the freeing of the slab as a whole. This is particularly
important for the dentry and inode cache handling since they reclaim objects
in LRU fashion. With this function they can see if the object is sitting in
a sparsely populated slab page and if so decide to reclaim the other objects
in the slab page. In many situations we can otherwise get high memory use
since only a very small portion of the available object slots are in use
(this can occur after a file scan or when the computational load on a server
changes).

kmem_cache_object() returns the number of objects currently in use. A parameter
allows the retrieval of the maximum number of objects that would fit into this
slab page.

The user must then use these numbers to determine if an effort should be made
to free the remaining objects. The allocated objects are returned in an array
of pointers.

Objects can only stay allocated if the user has some way of locking out
kmem_cache_free() operations on the slab. Otherwise the operations on the
returned object pointers cause race conditions.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h |   18 ++++++++++++++++++
 mm/slab.c            |   23 +++++++++++++++++++++++
 mm/slob.c            |    6 ++++++
 mm/slub.c            |   42 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2010-05-14 12:24:44.000000000 -0500
+++ linux-2.6/include/linux/slab.h	2010-05-14 12:37:36.000000000 -0500
@@ -110,6 +110,24 @@ int kern_ptr_validate(const void *ptr, u
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);

 /*
+ * Determine objects in the same slab page as a given object.
+ *
+ * The return value is the number of objects currently allocated in the slab
+ * or a negative error value and the maximum number of objects that this
+ * slab page could handle.
+ *
+ * Warning: The objects returned can be freed at any time and therefore the
+ * pointer can be invalid unless other measures are taken to avoid objects
+ * being freed while looping through the list of objects.
+ *
+ * Return codes:
+ *	-E2BIG	More objects than fit into the provided list.
+ *	-EBUSY	Objects in the slab are allocation queues.
+ */
+int kmem_cache_objects(struct kmem_cache *slab, const void *x,
+		 const void **list, int max, int *capacity);
+
+/*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
  *
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-14 12:37:27.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-14 12:40:10.000000000 -0500
@@ -2868,6 +2868,48 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);

+static void get_object(struct kmem_cache *s, void *object, void *private)
+{
+	const void ***list = private;
+
+	*(*list)++ = object;
+}
+
+int kmem_cache_objects(struct kmem_cache *s, const void *x,
+		const void **list, int list_size, int *capacity)
+{
+	int r;
+	struct page *page;
+	unsigned long *map;
+
+	page = virt_to_head_page(x);
+	BUG_ON(!PageSlab(page));
+	BUG_ON(page->slab != s);
+	*capacity = page->objects;
+
+	map = kmalloc(BITS_TO_LONGS(page->objects), GFP_KERNEL);
+
+	slab_lock(page);
+	r = page->inuse;
+
+	if (page->inuse > list_size) {
+		r = -E2BIG;
+		goto abort;
+	}
+
+	if (PageSlubFrozen(page)) {
+		r = -EBUSY;
+		goto abort;
+	}
+
+	traverse_objects(s, page, get_object, &list, map);
+
+abort:
+	slab_unlock(page);
+	kfree(map);
+	return r;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2010-05-14 12:24:44.000000000 -0500
+++ linux-2.6/mm/slab.c	2010-05-14 12:37:36.000000000 -0500
@@ -3617,6 +3617,29 @@ out:
 	return 0;
 }

+int kmem_cache_objects(struct kmem_cache *cachep, const void *objp,
+		const void **list, int list_size, int *capacity)
+{
+	struct slab *slabp = virt_to_slab(objp);
+	void *p;
+	int i;
+
+	BUG_ON(cachep != virt_to_cache(objp));
+
+	*capacity = cachep->num;
+	if (slabp->inuse > list_size)
+		return -E2BIG;
+
+	for (i = 0, p = slabp->s_mem; i < cachep->num;
+				 i++, p += cachep->buffer_size) {
+
+		if (slab_bufctl(slabp)[i] == BUFCTL_ACTIVE)
+			*(list) ++ = p;
+
+	}
+	return slabp->inuse;
+}
+
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
Index: linux-2.6/mm/slob.c
===================================================================
--- linux-2.6.orig/mm/slob.c	2010-05-14 12:24:44.000000000 -0500
+++ linux-2.6/mm/slob.c	2010-05-14 12:37:36.000000000 -0500
@@ -658,6 +658,12 @@ void kmem_cache_free(struct kmem_cache *
 }
 EXPORT_SYMBOL(kmem_cache_free);

+void kmem_cache_objects(struct kmem_cache *c, const void *b, void **list,
+						int list_size, int *capacity)
+{
+	return -EBUSY;
+}
+
 unsigned int kmem_cache_size(struct kmem_cache *c)
 {
 	return c->size;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Defrag in shrinkers
  2010-05-14 17:46 ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Christoph Lameter
@ 2010-05-14 20:36   ` Andi Kleen
  2010-05-15 17:08     ` Ed Tomlinson
  2010-05-15  1:15   ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Dave Chinner
  1 sibling, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2010-05-14 20:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Chinner, linux-kernel, xfs, linux-fsdevel, linux-mm,
	Pekka Enberg, npiggin

Christoph Lameter <cl@linux.com> writes:

> Would it also be possible to add some defragmentation logic when you
> revise the shrinkers? Here is a prototype patch that would allow you to
> determine the other objects sitting in the same page as a given object.
>
> With that I hope that you have enough information to determine if its
> worth to evict the other objects as well to reclaim the slab page.

I like the idea, it would be useful for the hwpoison code too,
when it tries to clean a page.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers)
  2010-05-14 17:46 ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Christoph Lameter
  2010-05-14 20:36   ` Defrag in shrinkers Andi Kleen
@ 2010-05-15  1:15   ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-15  1:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, xfs, linux-fsdevel, linux-mm, Pekka Enberg, npiggin

On Fri, May 14, 2010 at 12:46:52PM -0500, Christoph Lameter wrote:
> Would it also be possible to add some defragmentation logic when you
> revise the shrinkers? Here is a prototype patch that would allow you to
> determine the other objects sitting in the same page as a given object.
> 
> With that I hope that you have enough information to determine if its
> worth to evict the other objects as well to reclaim the slab page.

I'll have a think about how this might fit in - the real problem is
when the list returns objects that belong to a different superblock.
We can only safely check whether the object belongs to the current
superblock - to check if it belongs to a different sb we a lot of
locks and reference counting to juggle. That would require
re-introducing all the muck (and then some) that this patchset
removes from the shrinkers.

Perhaps just freeing the objects that belong to the current sb would
be sufficient to realise significant improvements (will be fine for
systems that only have one active or dominant filesystem), but i
think some experimentation would be needed.

The that brings us to test cases - we need a good one. I think we
need to re-evaluate where we stand with regard to slab fragmentation
(which probably hasn't changed much), and we need to be able to
quantify the amount of improvement the increase in complexity will
provide.  I don't have anything close to hand to generate such
fragmentation, so it might take a little time to write a test that
does the IO patterns I know will generate problems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Per-superblock shrinkers
  2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
                   ` (5 preceding siblings ...)
  2010-05-14 17:46 ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Christoph Lameter
@ 2010-05-15  1:30 ` Al Viro
  2010-05-17  0:19   ` Dave Chinner
  6 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2010-05-15  1:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs, linux-fsdevel, linux-mm

On Fri, May 14, 2010 at 05:24:18PM +1000, Dave Chinner wrote:
> 
> This series reworks the filesystem shrinkers. We currently have a
> set of issues with the current filesystem shrinkers:
> 
> 	1. There is an dependency between dentry and inode cache
> 	   shrinking that is only implicitly defined by the order of
> 	   shrinker registration.
> 	2. The shrinkers need to walk the superblock list and pin
> 	   the superblock to avoid unmount races with the sb going
> 	   away.
> 	3. The dentry cache uses per-superblock LRUs and proportions
> 	   reclaim between all the superblocks which means we are
> 	   doing breadth based reclaim. This means we touch every
> 	   superblock for every shrinker call, and may only reclaim
> 	   a single dentry at a time from a given superblock.
> 	4. The inode cache has a global LRU, so it has different
> 	   reclaim patterns to the dentry cache, despite the fact
> 	   that the dentry cache is generally the only thing that
> 	   pins inodes in memory.
> 	5. Filesystems need to register their own shrinkers for
> 	   caches and can't co-ordinate them with the dentry and
> 	   inode cache shrinkers.

NAK in that form; sb refcounting and iterators had been reworked for .34,
so at least it needs rediff on top of that.  What's more, it's very
obviously broken wrt locking - you are unregistering a shrinker
from __put_super().  I.e. grab rwsem exclusively under a spinlock.

Essentially, you've turned dropping a _passive_ reference to superblock
(currently an operation safe in any context) into an operation allowed
only when no fs or vm locks are held by caller.  Not going to work...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Defrag in shrinkers
  2010-05-14 20:36   ` Defrag in shrinkers Andi Kleen
@ 2010-05-15 17:08     ` Ed Tomlinson
  2010-05-17  0:24       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Ed Tomlinson @ 2010-05-15 17:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Dave Chinner, linux-kernel, xfs, linux-fsdevel,
	linux-mm, Pekka Enberg, npiggin

On Friday 14 May 2010 16:36:03 Andi Kleen wrote:
> Christoph Lameter <cl@linux.com> writes:
> 
> > Would it also be possible to add some defragmentation logic when you
> > revise the shrinkers? Here is a prototype patch that would allow you to
> > determine the other objects sitting in the same page as a given object.
> >
> > With that I hope that you have enough information to determine if its
> > worth to evict the other objects as well to reclaim the slab page.
> 
> I like the idea, it would be useful for the hwpoison code too,
> when it tries to clean a page.

If this is done generally we probably want to retune the 'pressure' put on the slab.  The
whole reason for the callbacks was to keep the 'pressure on the slab proportional to the
memory pressure (scan rate).  

Ed Tomlinson

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Per-superblock shrinkers
  2010-05-15  1:30 ` [PATCH 0/5] Per-superblock shrinkers Al Viro
@ 2010-05-17  0:19   ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-17  0:19 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, xfs, linux-fsdevel, linux-mm

On Sat, May 15, 2010 at 02:30:05AM +0100, Al Viro wrote:
> On Fri, May 14, 2010 at 05:24:18PM +1000, Dave Chinner wrote:
> > 
> > This series reworks the filesystem shrinkers. We currently have a
> > set of issues with the current filesystem shrinkers:
> > 
> > 	1. There is an dependency between dentry and inode cache
> > 	   shrinking that is only implicitly defined by the order of
> > 	   shrinker registration.
> > 	2. The shrinkers need to walk the superblock list and pin
> > 	   the superblock to avoid unmount races with the sb going
> > 	   away.
> > 	3. The dentry cache uses per-superblock LRUs and proportions
> > 	   reclaim between all the superblocks which means we are
> > 	   doing breadth based reclaim. This means we touch every
> > 	   superblock for every shrinker call, and may only reclaim
> > 	   a single dentry at a time from a given superblock.
> > 	4. The inode cache has a global LRU, so it has different
> > 	   reclaim patterns to the dentry cache, despite the fact
> > 	   that the dentry cache is generally the only thing that
> > 	   pins inodes in memory.
> > 	5. Filesystems need to register their own shrinkers for
> > 	   caches and can't co-ordinate them with the dentry and
> > 	   inode cache shrinkers.
> 
> NAK in that form; sb refcounting and iterators had been reworked for .34,
> so at least it needs rediff on top of that.

The tree I based this on was 2.6.34-rc7 - is there new code in a
-next branch somewhere?

> What's more, it's very
> obviously broken wrt locking - you are unregistering a shrinker
> from __put_super().  I.e. grab rwsem exclusively under a spinlock.
> Essentially, you've turned dropping a _passive_ reference to superblock
> (currently an operation safe in any context) into an operation allowed
> only when no fs or vm locks are held by caller.  Not going to work...

Yeah, I picked that up after I posted it. My bad - I'll look into how
I can rework that for the next iteration.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Defrag in shrinkers
  2010-05-15 17:08     ` Ed Tomlinson
@ 2010-05-17  0:24       ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-17  0:24 UTC (permalink / raw)
  To: Ed Tomlinson
  Cc: Andi Kleen, Christoph Lameter, linux-kernel, xfs, linux-fsdevel,
	linux-mm, Pekka Enberg, npiggin

On Sat, May 15, 2010 at 01:08:17PM -0400, Ed Tomlinson wrote:
> On Friday 14 May 2010 16:36:03 Andi Kleen wrote:
> > Christoph Lameter <cl@linux.com> writes:
> > 
> > > Would it also be possible to add some defragmentation logic when you
> > > revise the shrinkers? Here is a prototype patch that would allow you to
> > > determine the other objects sitting in the same page as a given object.
> > >
> > > With that I hope that you have enough information to determine if its
> > > worth to evict the other objects as well to reclaim the slab page.
> > 
> > I like the idea, it would be useful for the hwpoison code too,
> > when it tries to clean a page.
> 
> If this is done generally we probably want to retune the 'pressure' put on the slab.  The
> whole reason for the callbacks was to keep the 'pressure on the slab proportional to the
> memory pressure (scan rate).  

I don't see that defrag based reclaim changes the concept of
pressure at all. As long as reclaim follows the nr_to_scan
guideline, then it doesn't matter if we do reclaim from the LRU or
reclaim from a list provided by the slab cache....

FWIW, one thing that would be necessary, I think, is to avoid defrag
until a certain level of fragmentation has occurred - we should do
LRU-based reclaim as much as possible, and only trigger defrag-style
reclaim once we hit a trigger (e.g. once the slab is 25% partial
pages).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
@ 2010-05-25  8:53 ` Dave Chinner
  2010-05-26 16:41   ` Nick Piggin
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 16051 bytes --]

From: Dave Chinner <dchinner@redhat.com>

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well.  The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c        |  133 ++++++++--------------------------------------------
 fs/inode.c         |  109 +++---------------------------------------
 fs/super.c         |   53 +++++++++++++++++++++
 include/linux/fs.h |    7 +++
 4 files changed, 88 insertions(+), 214 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index dba6b6d..d7bd781 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
  * which flags are set. This means we don't need to maintain multiple
  * similar copies of this loop.
  */
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
 {
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
 	struct dentry *dentry;
-	int cnt = 0;
 
 	BUG_ON(!sb);
-	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
+	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
 	spin_lock(&dcache_lock);
-	if (count != NULL)
-		/* called from prune_dcache() and shrink_dcache_parent() */
-		cnt = *count;
-restart:
-	if (count == NULL)
+	if (count == -1)
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 	else {
 		while (!list_empty(&sb->s_dentry_lru)) {
@@ -492,13 +487,13 @@ restart:
 			} else {
 				list_move_tail(&dentry->d_lru, &tmp);
 				spin_unlock(&dentry->d_lock);
-				cnt--;
-				if (!cnt)
+				if (--count == 0)
 					break;
 			}
 			cond_resched_lock(&dcache_lock);
 		}
 	}
+prune_more:
 	while (!list_empty(&tmp)) {
 		dentry = list_entry(tmp.prev, struct dentry, d_lru);
 		dentry_lru_del_init(dentry);
@@ -516,88 +511,29 @@ restart:
 		/* dentry->d_lock was dropped in prune_one_dentry() */
 		cond_resched_lock(&dcache_lock);
 	}
-	if (count == NULL && !list_empty(&sb->s_dentry_lru))
-		goto restart;
-	if (count != NULL)
-		*count = cnt;
+	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
+		list_splice_init(&sb->s_dentry_lru, &tmp);
+		goto prune_more;
+	}
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lock);
 }
 
 /**
- * prune_dcache - shrink the dcache
- * @count: number of entries to try to free
+ * prune_dcache_sb - shrink the dcache
+ * @nr_to_scan: number of entries to try to free
  *
- * Shrink the dcache. This is done when we need more memory, or simply when we
- * need to unmount something (at which point we need to unuse all dentries).
+ * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
+ * done when we need more memory an called from the superblock shrinker
+ * function.
  *
- * This function may fail to free any resources if all the dentries are in use.
+ * This function may fail to free any resources if all the dentries are in
+ * use.
  */
-static void prune_dcache(int count)
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan)
 {
-	struct super_block *sb, *n;
-	int w_count;
-	int unused = dentry_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	spin_lock(&dcache_lock);
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_dentry_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_dentry_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_dentry_lru))) {
-				spin_unlock(&dcache_lock);
-				__shrink_dcache_sb(sb, &w_count,
-						DCACHE_REFERENCED);
-				pruned -= w_count;
-				spin_lock(&dcache_lock);
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		__put_super(sb);
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-	spin_unlock(&dcache_lock);
+	__shrink_dcache_sb(sb, nr_to_scan, DCACHE_REFERENCED);
 }
 
 /**
@@ -610,7 +546,7 @@ static void prune_dcache(int count)
  */
 void shrink_dcache_sb(struct super_block * sb)
 {
-	__shrink_dcache_sb(sb, NULL, 0);
+	__shrink_dcache_sb(sb, -1, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -878,37 +814,10 @@ void shrink_dcache_parent(struct dentry * parent)
 	int found;
 
 	while ((found = select_parent(parent)) != 0)
-		__shrink_dcache_sb(sb, &found, 0);
+		__shrink_dcache_sb(sb, found, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
-/*
- * Scan `nr' dentries and return the number which remain.
- *
- * We need to avoid reentering the filesystem if the caller is performing a
- * GFP_NOFS allocation attempt.  One example deadlock is:
- *
- * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
- * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
- * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
- */
-static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_dcache(nr);
-	}
-	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker dcache_shrinker = {
-	.shrink = shrink_dcache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -2316,8 +2225,6 @@ static void __init dcache_init(void)
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
-	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff --git a/fs/inode.c b/fs/inode.c
index 1e44ec5..5fb4a39 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,7 +25,6 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
-#include "internal.h"
 
 /*
  * This is needed for the following functions:
@@ -441,8 +440,10 @@ static int can_unuse(struct inode *inode)
 }
 
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -450,10 +451,10 @@ static int can_unuse(struct inode *inode)
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has metadata buffers attached to mapping->private_list then try
+ * to remove them.
  */
-static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
+void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -461,7 +462,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	unsigned long reap = 0;
 
 	spin_lock(&inode_lock);
-	for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
+	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
 		struct inode *inode;
 
 		if (list_empty(&sb->s_inode_lru))
@@ -500,103 +501,10 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
-	*nr_to_scan = nr_scanned;
 
 	dispose_list(&freeable);
 }
 
-static void prune_icache(int count)
-{
-	struct super_block *sb, *n;
-	int w_count;
-	int unused = inodes_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	down_read(&iprune_sem);
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_inodes_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_inodes_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_inode_lru))) {
-				shrink_icache_sb(sb, &w_count);
-				pruned -= w_count;
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		__put_super(sb);
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-	up_read(&iprune_sem);
-}
-
-/*
- * shrink_icache_memory() will attempt to reclaim some unused inodes.  Here,
- * "unused" means that no dentries are referring to the inodes: the files are
- * not open and the dcache references to those inodes have already been
- * reclaimed.
- *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
- */
-static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		/*
-		 * Nasty deadlock avoidance.  We may hold various FS locks,
-		 * and we don't want to recurse into the FS that called us
-		 * in clear_inode() and friends..
-		 */
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_icache(nr);
-	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker icache_shrinker = {
-	.shrink = shrink_icache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
@@ -1634,7 +1542,6 @@ void __init inode_init(void)
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
-	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/fs/super.c b/fs/super.c
index c554c53..07e22e3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -37,6 +37,50 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
+{
+	struct super_block *sb;
+	int count;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	/*
+	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
+	 * to recurse into the FS that called us in clear_inode() and friends..
+	 */
+	if (!(gfp_mask & __GFP_FS))
+		return -1;
+
+	/*
+	 * if we can't get the umount lock, then there's no point having the
+	 * shrinker try again because the sb is being torn down.
+	 */
+	if (!down_read_trylock(&sb->s_umount))
+		return -1;
+
+	if (!sb->s_root) {
+		up_read(&sb->s_umount);
+		return -1;
+	}
+
+	if (nr_to_scan) {
+		/* proportion the scan between the two cacheѕ */
+		int total;
+
+		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
+		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
+
+		/* prune dcache first as icache is pinned by it */
+		prune_dcache_sb(sb, count);
+		prune_icache_sb(sb, nr_to_scan - count);
+	}
+
+	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
+						* sysctl_vfs_cache_pressure;
+	up_read(&sb->s_umount);
+	return count;
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -99,6 +143,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
+
+		/*
+		 * The shrinker is set up here but not registered until after
+		 * the superblock has been filled out successfully.
+		 */
+		s->s_shrink.shrink = prune_super;
+		s->s_shrink.seeks = DEFAULT_SEEKS;
 	}
 out:
 	return s;
@@ -162,6 +213,7 @@ void deactivate_locked_super(struct super_block *s)
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
 		vfs_dq_off(s, 0);
+		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
 		put_filesystem(fs);
 		put_super(s);
@@ -335,6 +387,7 @@ retry:
 	list_add_tail(&s->s_list, &super_blocks);
 	list_add(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
+	register_shrinker(&s->s_shrink);
 	get_filesystem(type);
 	return s;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7b90c43..5bff2dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -382,6 +382,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -1385,8 +1386,14 @@ struct super_block {
 	 * generic_show_options()
 	 */
 	char *s_options;
+
+	struct shrinker s_shrink;	/* per-sb shrinker handle */
 };
 
+/* superblock cache pruning functions */
+void prune_icache_sb(struct super_block *sb, int nr_to_scan);
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
@ 2010-05-26 16:41   ` Nick Piggin
  2010-05-26 23:12     ` Dave Chinner
  2010-05-27  6:35   ` Nick Piggin
  2010-05-27 20:32   ` Andrew Morton
  2 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2010-05-26 16:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
>   * which flags are set. This means we don't need to maintain multiple
>   * similar copies of this loop.
>   */
> -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
>  {
>  	LIST_HEAD(referenced);
>  	LIST_HEAD(tmp);
>  	struct dentry *dentry;
> -	int cnt = 0;
>  
>  	BUG_ON(!sb);
> -	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
> +	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
>  	spin_lock(&dcache_lock);
> -	if (count != NULL)
> -		/* called from prune_dcache() and shrink_dcache_parent() */
> -		cnt = *count;
> -restart:
> -	if (count == NULL)
> +	if (count == -1)
>  		list_splice_init(&sb->s_dentry_lru, &tmp);
>  	else {
>  		while (!list_empty(&sb->s_dentry_lru)) {
> @@ -492,13 +487,13 @@ restart:
>  			} else {
>  				list_move_tail(&dentry->d_lru, &tmp);
>  				spin_unlock(&dentry->d_lock);
> -				cnt--;
> -				if (!cnt)
> +				if (--count == 0)
>  					break;
>  			}
>  			cond_resched_lock(&dcache_lock);
>  		}
>  	}
> +prune_more:
>  	while (!list_empty(&tmp)) {
>  		dentry = list_entry(tmp.prev, struct dentry, d_lru);
>  		dentry_lru_del_init(dentry);
> @@ -516,88 +511,29 @@ restart:
>  		/* dentry->d_lock was dropped in prune_one_dentry() */
>  		cond_resched_lock(&dcache_lock);
>  	}
> -	if (count == NULL && !list_empty(&sb->s_dentry_lru))
> -		goto restart;
> -	if (count != NULL)
> -		*count = cnt;
> +	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
> +		list_splice_init(&sb->s_dentry_lru, &tmp);
> +		goto prune_more;
> +	}

Nitpick but I prefer just the restart label wher it is previously. This
is moving setup for the next iteration into the "error" case.


> +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> +{
> +	struct super_block *sb;
> +	int count;
> +
> +	sb = container_of(shrink, struct super_block, s_shrink);
> +
> +	/*
> +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> +	 * to recurse into the FS that called us in clear_inode() and friends..
> +	 */
> +	if (!(gfp_mask & __GFP_FS))
> +		return -1;
> +
> +	/*
> +	 * if we can't get the umount lock, then there's no point having the
> +	 * shrinker try again because the sb is being torn down.
> +	 */
> +	if (!down_read_trylock(&sb->s_umount))
> +		return -1;

Would you just elaborate on the lock order problem somewhere? (the
comment makes it look like we *could* take the mutex if we wanted
to).


> +
> +	if (!sb->s_root) {
> +		up_read(&sb->s_umount);
> +		return -1;
> +	}
> +
> +	if (nr_to_scan) {
> +		/* proportion the scan between the two cacheѕ */
> +		int total;
> +
> +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> +
> +		/* prune dcache first as icache is pinned by it */
> +		prune_dcache_sb(sb, count);
> +		prune_icache_sb(sb, nr_to_scan - count);
> +	}
> +
> +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> +						* sysctl_vfs_cache_pressure;

Do you think truncating in the divisions is at all a problem? It
probably doesn't matter much I suppose.

> @@ -162,6 +213,7 @@ void deactivate_locked_super(struct super_block *s)
>  	struct file_system_type *fs = s->s_type;
>  	if (atomic_dec_and_test(&s->s_active)) {
>  		vfs_dq_off(s, 0);
> +		unregister_shrinker(&s->s_shrink);
>  		fs->kill_sb(s);
>  		put_filesystem(fs);
>  		put_super(s);
> @@ -335,6 +387,7 @@ retry:
>  	list_add_tail(&s->s_list, &super_blocks);
>  	list_add(&s->s_instances, &type->fs_supers);
>  	spin_unlock(&sb_lock);
> +	register_shrinker(&s->s_shrink);
>  	get_filesystem(type);
>  	return s;
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7b90c43..5bff2dc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -382,6 +382,7 @@ struct inodes_stat_t {
>  #include <linux/capability.h>
>  #include <linux/semaphore.h>
>  #include <linux/fiemap.h>
> +#include <linux/mm.h>
>  
>  #include <asm/atomic.h>
>  #include <asm/byteorder.h>
> @@ -1385,8 +1386,14 @@ struct super_block {
>  	 * generic_show_options()
>  	 */
>  	char *s_options;
> +
> +	struct shrinker s_shrink;	/* per-sb shrinker handle */
>  };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-26 16:41   ` Nick Piggin
@ 2010-05-26 23:12     ` Dave Chinner
  2010-05-27  2:19       ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-05-26 23:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> >   * which flags are set. This means we don't need to maintain multiple
> >   * similar copies of this loop.
> >   */
> > -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> > +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
> >  {
> >  	LIST_HEAD(referenced);
> >  	LIST_HEAD(tmp);
> >  	struct dentry *dentry;
> > -	int cnt = 0;
> >  
> >  	BUG_ON(!sb);
> > -	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
> > +	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
> >  	spin_lock(&dcache_lock);
> > -	if (count != NULL)
> > -		/* called from prune_dcache() and shrink_dcache_parent() */
> > -		cnt = *count;
> > -restart:
> > -	if (count == NULL)
> > +	if (count == -1)
> >  		list_splice_init(&sb->s_dentry_lru, &tmp);
> >  	else {
> >  		while (!list_empty(&sb->s_dentry_lru)) {
> > @@ -492,13 +487,13 @@ restart:
> >  			} else {
> >  				list_move_tail(&dentry->d_lru, &tmp);
> >  				spin_unlock(&dentry->d_lock);
> > -				cnt--;
> > -				if (!cnt)
> > +				if (--count == 0)
> >  					break;
> >  			}
> >  			cond_resched_lock(&dcache_lock);
> >  		}
> >  	}
> > +prune_more:
> >  	while (!list_empty(&tmp)) {
> >  		dentry = list_entry(tmp.prev, struct dentry, d_lru);
> >  		dentry_lru_del_init(dentry);
> > @@ -516,88 +511,29 @@ restart:
> >  		/* dentry->d_lock was dropped in prune_one_dentry() */
> >  		cond_resched_lock(&dcache_lock);
> >  	}
> > -	if (count == NULL && !list_empty(&sb->s_dentry_lru))
> > -		goto restart;
> > -	if (count != NULL)
> > -		*count = cnt;
> > +	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
> > +		list_splice_init(&sb->s_dentry_lru, &tmp);
> > +		goto prune_more;
> > +	}
> 
> Nitpick but I prefer just the restart label wher it is previously. This
> is moving setup for the next iteration into the "error" case.

Ok, will fix.

> > +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> > +{
> > +	struct super_block *sb;
> > +	int count;
> > +
> > +	sb = container_of(shrink, struct super_block, s_shrink);
> > +
> > +	/*
> > +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> > +	 * to recurse into the FS that called us in clear_inode() and friends..
> > +	 */
> > +	if (!(gfp_mask & __GFP_FS))
> > +		return -1;
> > +
> > +	/*
> > +	 * if we can't get the umount lock, then there's no point having the
> > +	 * shrinker try again because the sb is being torn down.
> > +	 */
> > +	if (!down_read_trylock(&sb->s_umount))
> > +		return -1;
> 
> Would you just elaborate on the lock order problem somewhere? (the
> comment makes it look like we *could* take the mutex if we wanted
> to).

The shrinker is unregistered in deactivate_locked_super() which is
just before ->kill_sb is called. The sb->s_umount lock is held at
this point. hence is the shrinker is operating, we will deadlock if
we try to lock it like this:

	unmount:			shrinker:
					down_read(&shrinker_lock);
	down_write(&sb->s_umount)
	unregister_shrinker()
	down_write(&shrinker_lock)
					prune_super()
					  down_read(&sb->s_umount);
					  (deadlock)

hence if we can't get the sb->s_umount lock in prune_super(), then
the superblock must be being unmounted and the shrinker should abort
as the ->kill_sb method will clean up everything after the shrinker
is unregistered. Hence the down_read_trylock().


> > +	if (!sb->s_root) {
> > +		up_read(&sb->s_umount);
> > +		return -1;
> > +	}
> > +
> > +	if (nr_to_scan) {
> > +		/* proportion the scan between the two cacheѕ */
> > +		int total;
> > +
> > +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > +
> > +		/* prune dcache first as icache is pinned by it */
> > +		prune_dcache_sb(sb, count);
> > +		prune_icache_sb(sb, nr_to_scan - count);
> > +	}
> > +
> > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > +						* sysctl_vfs_cache_pressure;
> 
> Do you think truncating in the divisions is at all a problem? It
> probably doesn't matter much I suppose.

Same code as currently exists. IIRC, the reasoning is that if we've
got less that 100 objects to reclaim, then we're unlikely to be able
to free up any memory from the caches, anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-26 23:12     ` Dave Chinner
@ 2010-05-27  2:19       ` Nick Piggin
  2010-05-27  4:07         ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2010-05-27  2:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> > > +	/*
> > > +	 * if we can't get the umount lock, then there's no point having the
> > > +	 * shrinker try again because the sb is being torn down.
> > > +	 */
> > > +	if (!down_read_trylock(&sb->s_umount))
> > > +		return -1;
> > 
> > Would you just elaborate on the lock order problem somewhere? (the
> > comment makes it look like we *could* take the mutex if we wanted
> > to).
> 
> The shrinker is unregistered in deactivate_locked_super() which is
> just before ->kill_sb is called. The sb->s_umount lock is held at
> this point. hence is the shrinker is operating, we will deadlock if
> we try to lock it like this:
> 
> 	unmount:			shrinker:
> 					down_read(&shrinker_lock);
> 	down_write(&sb->s_umount)
> 	unregister_shrinker()
> 	down_write(&shrinker_lock)
> 					prune_super()
> 					  down_read(&sb->s_umount);
> 					  (deadlock)
> 
> hence if we can't get the sb->s_umount lock in prune_super(), then
> the superblock must be being unmounted and the shrinker should abort
> as the ->kill_sb method will clean up everything after the shrinker
> is unregistered. Hence the down_read_trylock().

You added it to the comment in your updated patch, that was the main
thing I wanted. Thanks.


> > > +	if (!sb->s_root) {
> > > +		up_read(&sb->s_umount);
> > > +		return -1;
> > > +	}
> > > +
> > > +	if (nr_to_scan) {
> > > +		/* proportion the scan between the two cacheѕ */
> > > +		int total;
> > > +
> > > +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > > +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > > +
> > > +		/* prune dcache first as icache is pinned by it */
> > > +		prune_dcache_sb(sb, count);
> > > +		prune_icache_sb(sb, nr_to_scan - count);
> > > +	}
> > > +
> > > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > +						* sysctl_vfs_cache_pressure;
> > 
> > Do you think truncating in the divisions is at all a problem? It
> > probably doesn't matter much I suppose.
> 
> Same code as currently exists. IIRC, the reasoning is that if we've
> got less that 100 objects to reclaim, then we're unlikely to be able
> to free up any memory from the caches, anyway.

Yeah, which is why I stop short of saying you should change it in
this patch.

But I think we should ensure things can get reclaimed eventually.
100 objects could be 100 slabs, which could be anything from
half a meg to half a dozen. Multiplied by each of the caches.
Could be significant in small systems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  2:19       ` Nick Piggin
@ 2010-05-27  4:07         ` Dave Chinner
  2010-05-27  4:24           ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-05-27  4:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 12:19:05PM +1000, Nick Piggin wrote:
> On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > > > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > > +						* sysctl_vfs_cache_pressure;
> > > 
> > > Do you think truncating in the divisions is at all a problem? It
> > > probably doesn't matter much I suppose.
> > 
> > Same code as currently exists. IIRC, the reasoning is that if we've
> > got less that 100 objects to reclaim, then we're unlikely to be able
> > to free up any memory from the caches, anyway.
> 
> Yeah, which is why I stop short of saying you should change it in
> this patch.
> 
> But I think we should ensure things can get reclaimed eventually.
> 100 objects could be 100 slabs, which could be anything from
> half a meg to half a dozen. Multiplied by each of the caches.
> Could be significant in small systems.

True, but usually there are busy objects in the dentry and inode
slabs, so it shouldn't be a significant issue. We can probably
address such problems if they can be demonstrated to be an issue in
a separate patch set....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  4:07         ` Dave Chinner
@ 2010-05-27  4:24           ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-05-27  4:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 02:07:04PM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 12:19:05PM +1000, Nick Piggin wrote:
> > On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > > > > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > > > +						* sysctl_vfs_cache_pressure;
> > > > 
> > > > Do you think truncating in the divisions is at all a problem? It
> > > > probably doesn't matter much I suppose.
> > > 
> > > Same code as currently exists. IIRC, the reasoning is that if we've
> > > got less that 100 objects to reclaim, then we're unlikely to be able
> > > to free up any memory from the caches, anyway.
> > 
> > Yeah, which is why I stop short of saying you should change it in
> > this patch.
> > 
> > But I think we should ensure things can get reclaimed eventually.
> > 100 objects could be 100 slabs, which could be anything from
> > half a meg to half a dozen. Multiplied by each of the caches.
> > Could be significant in small systems.
> 
> True, but usually there are busy objects in the dentry and inode
> slabs, so it shouldn't be a significant issue. We can probably
> address such problems if they can be demonstrated to be an issue in
> a separate patch set....

I didn't want to say it is a problem with your patchset, I just
thought of it when reviewing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
  2010-05-26 16:41   ` Nick Piggin
@ 2010-05-27  6:35   ` Nick Piggin
  2010-05-27 22:40     ` Dave Chinner
  2010-05-27 20:32   ` Andrew Morton
  2 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2010-05-27  6:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -37,6 +37,50 @@
>  LIST_HEAD(super_blocks);
>  DEFINE_SPINLOCK(sb_lock);
>  
> +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> +{
> +	struct super_block *sb;
> +	int count;
> +
> +	sb = container_of(shrink, struct super_block, s_shrink);
> +
> +	/*
> +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> +	 * to recurse into the FS that called us in clear_inode() and friends..
> +	 */
> +	if (!(gfp_mask & __GFP_FS))
> +		return -1;
> +
> +	/*
> +	 * if we can't get the umount lock, then there's no point having the
> +	 * shrinker try again because the sb is being torn down.
> +	 */
> +	if (!down_read_trylock(&sb->s_umount))
> +		return -1;
> +
> +	if (!sb->s_root) {
> +		up_read(&sb->s_umount);
> +		return -1;
> +	}
> +
> +	if (nr_to_scan) {
> +		/* proportion the scan between the two cacheѕ */
> +		int total;
> +
> +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> +
> +		/* prune dcache first as icache is pinned by it */
> +		prune_dcache_sb(sb, count);
> +		prune_icache_sb(sb, nr_to_scan - count);

Hmm, an interesting dynamic that you've changed is that previously
we'd scan dcache LRU proportionately to pagecache, and then scan
inode LRU in proportion to the current number of unused inodes.

But we can think of inodes that are only in use by unused (and aged)
dentries as effectively unused themselves. So this sequence under
estimates how many inodes to scan. This could bias pressure against
dcache I'd think, especially considering inodes are far larger than
dentries. Maybe require 2 passes to get the inodes unused inthe
first pass.

Part of the problem is the funny shrinker API.

The right way to do it is to change the shrinker API so that it passes
down the lru_pages and scanned into the callback. From there, the
shrinkers can calculate the appropriate ratio of objects to scan.
No need for 2-call scheme, no need for shrinker->seeks, and the
ability to calculate an appropriate ratio first for dcache, and *then*
for icache.

A helper of course can do the calculation (considering that every
driver and their dog will do the wrong thing if we let them :)).

unsigned long shrinker_scan(unsigned long lru_pages,
			unsigned long lru_scanned,
			unsigned long nr_objects,
			unsigned long scan_ratio)
{
	unsigned long long tmp = nr_objects;

	tmp *= lru_scanned * 100;
	do_div(tmp, (lru_pages * scan_ratio) + 1);

	return (unsigned long)tmp;
}

Then the shrinker callback will go:
	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
				sb->s_nr_dentry_unused,
				vfs_cache_pressure * SEEKS_PER_DENTRY);
	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
		prune_dcache()

	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
				sb->s_nr_inodes_unused,
				vfs_cache_pressure * SEEKS_PER_INODE);
	...

What do you think of that? Seeing as we're changing the shrinker API
anyway, I'd think it is high time to do somthing like this.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
  2010-05-26 16:41   ` Nick Piggin
  2010-05-27  6:35   ` Nick Piggin
@ 2010-05-27 20:32   ` Andrew Morton
  2010-05-27 23:01     ` Dave Chinner
  2 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2010-05-27 20:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, 25 May 2010 18:53:06 +1000
Dave Chinner <david@fromorbit.com> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> With context based shrinkers, we can implement a per-superblock
> shrinker that shrinks the caches attached to the superblock. We
> currently have global shrinkers for the inode and dentry caches that
> split up into per-superblock operations via a coarse proportioning
> method that does not batch very well.  The global shrinkers also
> have a dependency - dentries pin inodes - so we have to be very
> careful about how we register the global shrinkers so that the
> implicit call order is always correct.
> 
> With a per-sb shrinker callout, we can encode this dependency
> directly into the per-sb shrinker, hence avoiding the need for
> strictly ordering shrinker registrations. We also have no need for
> any proportioning code for the shrinker subsystem already provides
> this functionality across all shrinkers. Allowing the shrinker to
> operate on a single superblock at a time means that we do less
> superblock list traversals and locking and reclaim should batch more
> effectively. This should result in less CPU overhead for reclaim and
> potentially faster reclaim of items from each filesystem.
> 

I go all tingly when a changelog contains the word "should".

OK, it _should_ do X.  But _does_ it actually do X?

>  fs/super.c         |   53 +++++++++++++++++++++
>  include/linux/fs.h |    7 +++
>  4 files changed, 88 insertions(+), 214 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index dba6b6d..d7bd781 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
>   * which flags are set. This means we don't need to maintain multiple
>   * similar copies of this loop.
>   */
> -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)

Forgot to update the kerneldoc description of `count'.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  6:35   ` Nick Piggin
@ 2010-05-27 22:40     ` Dave Chinner
  2010-05-28  5:19       ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-05-27 22:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -37,6 +37,50 @@
> >  LIST_HEAD(super_blocks);
> >  DEFINE_SPINLOCK(sb_lock);
> >  
> > +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> > +{
> > +	struct super_block *sb;
> > +	int count;
> > +
> > +	sb = container_of(shrink, struct super_block, s_shrink);
> > +
> > +	/*
> > +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> > +	 * to recurse into the FS that called us in clear_inode() and friends..
> > +	 */
> > +	if (!(gfp_mask & __GFP_FS))
> > +		return -1;
> > +
> > +	/*
> > +	 * if we can't get the umount lock, then there's no point having the
> > +	 * shrinker try again because the sb is being torn down.
> > +	 */
> > +	if (!down_read_trylock(&sb->s_umount))
> > +		return -1;
> > +
> > +	if (!sb->s_root) {
> > +		up_read(&sb->s_umount);
> > +		return -1;
> > +	}
> > +
> > +	if (nr_to_scan) {
> > +		/* proportion the scan between the two cacheѕ */
> > +		int total;
> > +
> > +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > +
> > +		/* prune dcache first as icache is pinned by it */
> > +		prune_dcache_sb(sb, count);
> > +		prune_icache_sb(sb, nr_to_scan - count);
> 
> Hmm, an interesting dynamic that you've changed is that previously
> we'd scan dcache LRU proportionately to pagecache, and then scan
> inode LRU in proportion to the current number of unused inodes.
> 
> But we can think of inodes that are only in use by unused (and aged)
> dentries as effectively unused themselves. So this sequence under
> estimates how many inodes to scan. This could bias pressure against
> dcache I'd think, especially considering inodes are far larger than
> dentries. Maybe require 2 passes to get the inodes unused inthe
> first pass.

It's self-balancing - it trends towards an equal number of unused
dentries and inodes in the caches. Yes, it will tear down more
dentries at first, but we need to do that to be able to reclaim
inodes. Ås reclaim progresses the propotion of inodes increases, so
the amount of inodes reclaimed increases. 

Basically this is a recognition that the important cache for
avoiding IO is the inode cache, not he dentry cache. Once the inode
cache is freed that we need to do IO to repopulate it, but
rebuilding dentries fromteh inode cache only costs CPU time. Hence
under light reclaim, inodes are mostly left in cache but we free up
memory that only costs CPU to rebuild. Under heavy, sustained
reclaim, we trend towards freeing equal amounts of objects from both
caches.

This is pretty much what the current code attempts to do - free a
lot of dentries, then free a smaller amount of the inodes that were
used by the freed dentries. Once again it is a direct encoding of
what is currently an implicit design feature - it makes it *obvious*
how we are trying to balance the caches.

Another reason for this is that the calculation changes again to
allow filesystem caches to modiy this proportioning in the next
patch....

FWIW, this also makes workloads that generate hundreds of thousands
of never-to-be-used again negative dentries free dcache memory really
quickly on memory pressure...

> Part of the problem is the funny shrinker API.
> 
> The right way to do it is to change the shrinker API so that it passes
> down the lru_pages and scanned into the callback. From there, the
> shrinkers can calculate the appropriate ratio of objects to scan.
> No need for 2-call scheme, no need for shrinker->seeks, and the
> ability to calculate an appropriate ratio first for dcache, and *then*
> for icache.

My only concern about this is that exposes the inner workings of the
shrinker and mm subsystem to code that simply doesn't need to know
about it.


> A helper of course can do the calculation (considering that every
> driver and their dog will do the wrong thing if we let them :)).
> 
> unsigned long shrinker_scan(unsigned long lru_pages,
> 			unsigned long lru_scanned,
> 			unsigned long nr_objects,
> 			unsigned long scan_ratio)
> {
> 	unsigned long long tmp = nr_objects;
> 
> 	tmp *= lru_scanned * 100;
> 	do_div(tmp, (lru_pages * scan_ratio) + 1);
> 
> 	return (unsigned long)tmp;
> }
> 
> Then the shrinker callback will go:
> 	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
> 				sb->s_nr_dentry_unused,
> 				vfs_cache_pressure * SEEKS_PER_DENTRY);
> 	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
> 		prune_dcache()
> 
> 	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
> 				sb->s_nr_inodes_unused,
> 				vfs_cache_pressure * SEEKS_PER_INODE);
> 	...
> 
> What do you think of that? Seeing as we're changing the shrinker API
> anyway, I'd think it is high time to do somthing like this.

Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two
call API that matches the current behaviour, leaving the caclulation
of how much to reclaim in shrink_slab(). Encoding it this way makes
it more difficult to change the high level behaviour e.g. if we want
to modify the amount of slab reclaim based on reclaim priority, we'd
have to cahnge every shrinker instead of just shrink_slab().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27 20:32   ` Andrew Morton
@ 2010-05-27 23:01     ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2010-05-27 23:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 01:32:34PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 18:53:06 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With context based shrinkers, we can implement a per-superblock
> > shrinker that shrinks the caches attached to the superblock. We
> > currently have global shrinkers for the inode and dentry caches that
> > split up into per-superblock operations via a coarse proportioning
> > method that does not batch very well.  The global shrinkers also
> > have a dependency - dentries pin inodes - so we have to be very
> > careful about how we register the global shrinkers so that the
> > implicit call order is always correct.
> > 
> > With a per-sb shrinker callout, we can encode this dependency
> > directly into the per-sb shrinker, hence avoiding the need for
> > strictly ordering shrinker registrations. We also have no need for
> > any proportioning code for the shrinker subsystem already provides
> > this functionality across all shrinkers. Allowing the shrinker to
> > operate on a single superblock at a time means that we do less
> > superblock list traversals and locking and reclaim should batch more
> > effectively. This should result in less CPU overhead for reclaim and
> > potentially faster reclaim of items from each filesystem.
> > 
> 
> I go all tingly when a changelog contains the word "should".
> 
> OK, it _should_ do X.  But _does_ it actually do X?

As i said to Nick - the tests I ran showed an average improvement of
5% but the accuracy of the benchmark was +/-10%. Hence it's hard to
draw any conclusive results from that. It appears to be slightly
faster on an otherwise idle system, but...

As it is, the XFS shrinker that gets integrated into this structure
in a later patch peaks at a higher rate - 150k inodes/s vs 90k
inodes/s with the current shrinker - but still it's hard to quantify
qualitatively. I'm going to run more benchmarks to try to get better
numbers.

> >  fs/super.c         |   53 +++++++++++++++++++++
> >  include/linux/fs.h |    7 +++
> >  4 files changed, 88 insertions(+), 214 deletions(-)
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index dba6b6d..d7bd781 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> >   * which flags are set. This means we don't need to maintain multiple
> >   * similar copies of this loop.
> >   */
> > -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> > +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
> 
> Forgot to update the kerneldoc description of `count'.

Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27 22:40     ` Dave Chinner
@ 2010-05-28  5:19       ` Nick Piggin
  2010-05-31  6:39         ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2010-05-28  5:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > But we can think of inodes that are only in use by unused (and aged)
> > dentries as effectively unused themselves. So this sequence under
> > estimates how many inodes to scan. This could bias pressure against
> > dcache I'd think, especially considering inodes are far larger than
> > dentries. Maybe require 2 passes to get the inodes unused inthe
> > first pass.
> 
> It's self-balancing - it trends towards an equal number of unused
> dentries and inodes in the caches. Yes, it will tear down more
> dentries at first, but we need to do that to be able to reclaim
> inodes.

But then it doesn't scan enough inodes on the inode pass.


> Ås reclaim progresses the propotion of inodes increases, so
> the amount of inodes reclaimed increases. 
> 
> Basically this is a recognition that the important cache for
> avoiding IO is the inode cache, not he dentry cache. Once the inode

You can bias against the dcache using multipliers.


> cache is freed that we need to do IO to repopulate it, but
> rebuilding dentries fromteh inode cache only costs CPU time. Hence
> under light reclaim, inodes are mostly left in cache but we free up
> memory that only costs CPU to rebuild. Under heavy, sustained
> reclaim, we trend towards freeing equal amounts of objects from both
> caches.

I don't know if you've got numbers or patterns to justify that.
My point is that things should stay as close to the old code as
possible without good reason.

 
> This is pretty much what the current code attempts to do - free a
> lot of dentries, then free a smaller amount of the inodes that were
> used by the freed dentries. Once again it is a direct encoding of
> what is currently an implicit design feature - it makes it *obvious*
> how we are trying to balance the caches.

With your patches, if there are no inodes free you would need to take
2 passes at freeing the dentry cache. My suggestion is closer to the
current code.
 

> Another reason for this is that the calculation changes again to
> allow filesystem caches to modiy this proportioning in the next
> patch....
> 
> FWIW, this also makes workloads that generate hundreds of thousands
> of never-to-be-used again negative dentries free dcache memory really
> quickly on memory pressure...

That would still be the case because used inodes aren't getting their
dentries freed so little inode scanning will occur.

> 
> > Part of the problem is the funny shrinker API.
> > 
> > The right way to do it is to change the shrinker API so that it passes
> > down the lru_pages and scanned into the callback. From there, the
> > shrinkers can calculate the appropriate ratio of objects to scan.
> > No need for 2-call scheme, no need for shrinker->seeks, and the
> > ability to calculate an appropriate ratio first for dcache, and *then*
> > for icache.
> 
> My only concern about this is that exposes the inner workings of the
> shrinker and mm subsystem to code that simply doesn't need to know
> about it.

It's just providing a ratio. The shrinkers allready know they are
scanning based on a ratio of pagecache scanned.


> > A helper of course can do the calculation (considering that every
> > driver and their dog will do the wrong thing if we let them :)).
> > 
> > unsigned long shrinker_scan(unsigned long lru_pages,
> > 			unsigned long lru_scanned,
> > 			unsigned long nr_objects,
> > 			unsigned long scan_ratio)
> > {
> > 	unsigned long long tmp = nr_objects;
> > 
> > 	tmp *= lru_scanned * 100;
> > 	do_div(tmp, (lru_pages * scan_ratio) + 1);
> > 
> > 	return (unsigned long)tmp;
> > }
> > 
> > Then the shrinker callback will go:
> > 	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
> > 				sb->s_nr_dentry_unused,
> > 				vfs_cache_pressure * SEEKS_PER_DENTRY);
> > 	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
> > 		prune_dcache()
> > 
> > 	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
> > 				sb->s_nr_inodes_unused,
> > 				vfs_cache_pressure * SEEKS_PER_INODE);
> > 	...
> > 
> > What do you think of that? Seeing as we're changing the shrinker API
> > anyway, I'd think it is high time to do somthing like this.
> 
> Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two

Well if it is an issue, it should be changed in a different patch
I think (with numbers).


> call API that matches the current behaviour, leaving the caclulation
> of how much to reclaim in shrink_slab(). Encoding it this way makes
> it more difficult to change the high level behaviour e.g. if we want
> to modify the amount of slab reclaim based on reclaim priority, we'd
> have to cahnge every shrinker instead of just shrink_slab().

We can modifiy the ratios before calling if needed, or have a default
ratio define to multiply with as well.

But shrinkers are very subsystem specific.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-28  5:19       ` Nick Piggin
@ 2010-05-31  6:39         ` Dave Chinner
  2010-05-31  7:28           ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-05-31  6:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Fri, May 28, 2010 at 03:19:24PM +1000, Nick Piggin wrote:
> On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > > But we can think of inodes that are only in use by unused (and aged)
> > > dentries as effectively unused themselves. So this sequence under
> > > estimates how many inodes to scan. This could bias pressure against
> > > dcache I'd think, especially considering inodes are far larger than
> > > dentries. Maybe require 2 passes to get the inodes unused inthe
> > > first pass.
> > 
> > It's self-balancing - it trends towards an equal number of unused
> > dentries and inodes in the caches. Yes, it will tear down more
> > dentries at first, but we need to do that to be able to reclaim
> > inodes.
> 
> But then it doesn't scan enough inodes on the inode pass.

We don't get a single shrinker call - we get batches of them fo each
shrinker.  The shrinker determines how many objects to scan in the
cache, and that only changes for each shrinker to call based on the
the shrinker->seeks and the number of objects in the cache.  Once
the number is decided, then the shrinker gets batches of reclaim to
operate on.

Fundamentally, the current shrinker will ask for the same percentage
of each cache to be scanned - if it decides to scan 20% of the
dentry cache, it will also decide to scan 20% of the inode cache
Hence what the inode shrinker is doing is scanning 20% of the inodes
freed by the dcache shrinker.

In rough numbers, say we have 100k dentries, and the shrinker
calculates it needs to scan 20% of the caches to reclaim them, the
current code will end up with:

		unused dentries		unused inodes
before		 100k			    0
after dentry	  80k			  20k
after inode	  80k			  16k

So we get 20k dentries freed and 4k inodes freed on that shrink_slab
pass.

To contrast this against the code I proposed, I'll make a couple of
simplicfications to avoid hurting my brain. That is, I'll assume
SHRINK_BATCH=100 (rather than 128) and forgetting about rounding
errors. With this, the algorithm I encoded gives roughly the
following for a 20% object reclaim:

number of batches = 20k / 100 = 200

				Unused
		dentries+inodes		dentries	inodes
before		  100k			100k		    0
batch 1		  100k			99900		  100
batch 2		  100k			99800		  200
....
batch 10	  100k			99000		 1000
batch 20	  99990			98010		 1990
batch 30	  99980			97030		 2950
batch 50	  99910			95100		 4810
batch 60	  99860			94150		 5710
.....
batch 200	  98100			81900		16200

And so (roughly) we see that the number of inodes being reclaim per
set of 10 batches roughly equals the (batch number - 10). Hence over
200 batches, we can expect to see roughly 190 + 180 + ... + 10
inodes reclaimed. That is 1900 inodes.  Similarly for dentries, we
get roughly 1000 + 990 + 980 + ... 810 dentries reclaimed - 18,100
in total.

In other words, we have roughly 18k dentries and 1.9k inodes
reclaimed for the code I wrote new algorithm. That does mean it
initially attempts to reclaim dentries faster than the current code, but
as the number of unused inodes increases, this comes back to parity
with the current code and we end up with a 1:1 reclaim ratio.

This is good behaviour - dentries are cheap to reconstruct from the
inode cache, and we should hold onto the inode cache as much as
possible. i.e. we should reclaim them more aggressively only if
there is sustained pressure on the superblock and that is what the
above algorithm does.

> > Ås reclaim progresses the propotion of inodes increases, so
> > the amount of inodes reclaimed increases. 
> > 
> > Basically this is a recognition that the important cache for
> > avoiding IO is the inode cache, not he dentry cache. Once the inode
> 
> You can bias against the dcache using multipliers.

Multipliers are not self-balancing, and generally just amplify any
imbalance an algorithm tends towards. The vfs_cache_pressure
multiplier is a shining example of this kind of utterly useless
knob...

> > > Part of the problem is the funny shrinker API.
> > > 
> > > The right way to do it is to change the shrinker API so that it passes
> > > down the lru_pages and scanned into the callback. From there, the
> > > shrinkers can calculate the appropriate ratio of objects to scan.
> > > No need for 2-call scheme, no need for shrinker->seeks, and the
> > > ability to calculate an appropriate ratio first for dcache, and *then*
> > > for icache.
> > 
> > My only concern about this is that exposes the inner workings of the
> > shrinker and mm subsystem to code that simply doesn't need to know
> > about it.
> 
> It's just providing a ratio. The shrinkers allready know they are
> scanning based on a ratio of pagecache scanned.

Sure, but the shrinkers are just a simple mechanism for implementing
VM policy decisions. IMO reclaim policy decisions should not be
pushed down and replicated in every one of these reclaim mechanisms.

> But shrinkers are very subsystem specific.

And as such should concentrate on getting their subsystem reclaim
correct, not have to worry about implementing VM policy
calculations...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-31  6:39         ` Dave Chinner
@ 2010-05-31  7:28           ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-05-31  7:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Mon, May 31, 2010 at 04:39:38PM +1000, Dave Chinner wrote:
> On Fri, May 28, 2010 at 03:19:24PM +1000, Nick Piggin wrote:
> > On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > > > But we can think of inodes that are only in use by unused (and aged)
> > > > dentries as effectively unused themselves. So this sequence under
> > > > estimates how many inodes to scan. This could bias pressure against
> > > > dcache I'd think, especially considering inodes are far larger than
> > > > dentries. Maybe require 2 passes to get the inodes unused inthe
> > > > first pass.
> > > 
> > > It's self-balancing - it trends towards an equal number of unused
> > > dentries and inodes in the caches. Yes, it will tear down more
> > > dentries at first, but we need to do that to be able to reclaim
> > > inodes.
> > 
> > But then it doesn't scan enough inodes on the inode pass.
> 
> We don't get a single shrinker call - we get batches of them fo each
> shrinker.

OK fair point. However

[...]

> In other words, we have roughly 18k dentries and 1.9k inodes
> reclaimed for the code I wrote new algorithm. That does mean it
> initially attempts to reclaim dentries faster than the current code, but
> as the number of unused inodes increases, this comes back to parity
> with the current code and we end up with a 1:1 reclaim ratio.
> 
> This is good behaviour - dentries are cheap to reconstruct from the
> inode cache, and we should hold onto the inode cache as much as
> possible. i.e. we should reclaim them more aggressively only if
> there is sustained pressure on the superblock and that is what the
> above algorithm does.

I prefer just to keep changes to a minimum and split into seperate
patches (each with at least basic test or two showing no regression).

As-is you're already changing global inode/dentry passes into per
sb inode and dentry passes. I think it can only be a good thing
for that changeset if other changes are minimised.

Then if it is so obviously good behaviour to reduce dcache pressure,
it should be easy to justify that too.

 
> > > Ås reclaim progresses the propotion of inodes increases, so
> > > the amount of inodes reclaimed increases. 
> > > 
> > > Basically this is a recognition that the important cache for
> > > avoiding IO is the inode cache, not he dentry cache. Once the inode
> > 
> > You can bias against the dcache using multipliers.
> 
> Multipliers are not self-balancing, and generally just amplify any
> imbalance an algorithm tends towards. The vfs_cache_pressure
> multiplier is a shining example of this kind of utterly useless
> knob...

Well you can also bias against the dcache with any other means,
including the change you've made here. My main point I guess is
that it should not be in the same as this patchset (or at least
an individual patch).

 
> > > > Part of the problem is the funny shrinker API.
> > > > 
> > > > The right way to do it is to change the shrinker API so that it passes
> > > > down the lru_pages and scanned into the callback. From there, the
> > > > shrinkers can calculate the appropriate ratio of objects to scan.
> > > > No need for 2-call scheme, no need for shrinker->seeks, and the
> > > > ability to calculate an appropriate ratio first for dcache, and *then*
> > > > for icache.
> > > 
> > > My only concern about this is that exposes the inner workings of the
> > > shrinker and mm subsystem to code that simply doesn't need to know
> > > about it.
> > 
> > It's just providing a ratio. The shrinkers allready know they are
> > scanning based on a ratio of pagecache scanned.
> 
> Sure, but the shrinkers are just a simple mechanism for implementing
> VM policy decisions. IMO reclaim policy decisions should not be
> pushed down and replicated in every one of these reclaim mechanisms.

Not really. The VM doesn't know about any of those. They are just
told to provide a ratio and some scanning based on some abstract cost.

The VM doesn't know anything about usage patterns, inuse vs unused
objects, exactly how their LRU algorithms are supposed to work, etc.

There is very little policy decision by the VM in the shrinkers.

 
> > But shrinkers are very subsystem specific.
> 
> And as such should concentrate on getting their subsystem reclaim
> correct, not have to worry about implementing VM policy
> calculations...

Clearly they wouldn't with what I was proposing. And the result would
be much more flexible and also gives the shrinkers more information.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-05-31  7:28 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-14  7:24 [PATCH 0/5] Per-superblock shrinkers Dave Chinner
2010-05-14  7:24 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
2010-05-14  7:24 ` [PATCH 2/5] mm: add context argument to shrinker callback Dave Chinner
2010-05-14  7:24 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
2010-05-14  7:24 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
2010-05-14  7:24 ` [PATCH 5/5] xfs: make use of new shrinker callout Dave Chinner
2010-05-14 17:46 ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Christoph Lameter
2010-05-14 20:36   ` Defrag in shrinkers Andi Kleen
2010-05-15 17:08     ` Ed Tomlinson
2010-05-17  0:24       ` Dave Chinner
2010-05-15  1:15   ` Defrag in shrinkers (was Re: [PATCH 0/5] Per-superblock shrinkers) Dave Chinner
2010-05-15  1:30 ` [PATCH 0/5] Per-superblock shrinkers Al Viro
2010-05-17  0:19   ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
2010-05-26 16:41   ` Nick Piggin
2010-05-26 23:12     ` Dave Chinner
2010-05-27  2:19       ` Nick Piggin
2010-05-27  4:07         ` Dave Chinner
2010-05-27  4:24           ` Nick Piggin
2010-05-27  6:35   ` Nick Piggin
2010-05-27 22:40     ` Dave Chinner
2010-05-28  5:19       ` Nick Piggin
2010-05-31  6:39         ` Dave Chinner
2010-05-31  7:28           ` Nick Piggin
2010-05-27 20:32   ` Andrew Morton
2010-05-27 23:01     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).