All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: npiggin@suse.de
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Stultz <johnstul@us.ibm.com>,
	Frank Mayhar <fmayhar@google.com>
Subject: Re: [patch 50/52] mm: implement per-zone shrinker
Date: Wed, 30 Jun 2010 16:28:58 +1000	[thread overview]
Message-ID: <20100630062858.GE24712@dastard> (raw)
In-Reply-To: <20100624030733.676440935@suse.de>

On Thu, Jun 24, 2010 at 01:03:02PM +1000, npiggin@suse.de wrote:
> Allow the shrinker to do per-zone shrinking. This means it is called for
> each zone scanned. The shrinker is now completely responsible for calculating
> and batching (given helpers), which provides better flexibility.
> 
> Finding the ratio of objects to scan requires scaling the ratio of pagecache
> objects scanned. By passing down both the per-zone and the global reclaimable
> pages, per-zone caches and global caches can be calculated correctly.
> 
> Finally, add some fixed-point scaling to the ratio, which helps calculations.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  fs/dcache.c        |    2 
>  fs/drop_caches.c   |    2 
>  fs/inode.c         |    2 
>  fs/mbcache.c       |    4 -
>  fs/nfs/dir.c       |    2 
>  fs/nfs/internal.h  |    2 
>  fs/quota/dquot.c   |    2 
>  include/linux/mm.h |    6 +-
>  mm/vmscan.c        |  131 ++++++++++++++---------------------------------------
>  9 files changed, 47 insertions(+), 106 deletions(-)

The diffstat doesn't match the patch ;)


> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -999,16 +999,19 @@ static inline void sync_mm_rss(struct ta
>   * querying the cache size, so a fastpath for that case is appropriate.
>   */
>  struct shrinker {
> -	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
> -	int seeks;	/* seeks to recreate an obj */
> -
> +	int (*shrink)(struct zone *zone, unsigned long scanned, unsigned long total,
> +					unsigned long global, gfp_t gfp_mask);

Can we add the shrinker structure to taht callback, too, so that we
can get away from needing global context for the shrinker?


> +unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch)
> +{
> +	unsigned long nr = ACCESS_ONCE(*dst);

What's the point of ACCESS_ONCE() here?

/me gets most of the way into the patch

Oh, it's because you are using static variables for nr_to_scan and
hence when concurrent shrinkers are running they are all
incrementing and decrementing the same variable. That doesn't sound
like a good idea to me - concurrent shrinkers are much more likely
with per-zone shrinker callouts. It seems to me that a reclaim
thread could be kept in a shrinker long after it has run it's
scan count if new shrinker calls from a different reclaim context
occur before the first has finished....

As a further question - why do some shrinkerѕ get converted to a
single global nr_to_scan, and others get converted to a private
nr_to_scan? Shouldn't they all use the same method? The static
variable method looks to me to be full of races - concurrent callers
to shrinker_add_scan() does not look at all thread safe to me.

> +	if (nr < batch)
> +		return 0;

Why wouldn't we return nr here to drain the remaining objects?
Doesn't this mean we can't shrink caches that have a scan count of
less than SHRINK_BATCH?

> +	*dst = nr - batch;

Similarly, that is not a threadsafe update.

> +	return batch;
> +}
> +EXPORT_SYMBOL(shrinker_do_scan);
> +
>  /*
>   * Call the shrink functions to age shrinkable caches
>   *
> @@ -198,8 +228,8 @@ EXPORT_SYMBOL(unregister_shrinker);
>   *
>   * Returns the number of slab objects which we shrunk.
>   */
> -unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> -			unsigned long lru_pages)
> +static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
> +			unsigned long global, gfp_t gfp_mask)
>  {
>  	struct shrinker *shrinker;
>  	unsigned long ret = 0;
> @@ -211,55 +241,25 @@ unsigned long shrink_slab(unsigned long
>  		return 1;	/* Assume we'll be able to shrink next time */
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		unsigned long long delta;
> -		unsigned long total_scan;
> -		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
> -
> -		delta = (4 * scanned) / shrinker->seeks;
> -		delta *= max_pass;
> -		do_div(delta, lru_pages + 1);
> -		shrinker->nr += delta;
> -		if (shrinker->nr < 0) {
> -			printk(KERN_ERR "shrink_slab: %pF negative objects to "
> -			       "delete nr=%ld\n",
> -			       shrinker->shrink, shrinker->nr);
> -			shrinker->nr = max_pass;
> -		}
> -
> -		/*
> -		 * Avoid risking looping forever due to too large nr value:
> -		 * never try to free more than twice the estimate number of
> -		 * freeable entries.
> -		 */
> -		if (shrinker->nr > max_pass * 2)
> -			shrinker->nr = max_pass * 2;
> -
> -		total_scan = shrinker->nr;
> -		shrinker->nr = 0;
> -
> -		while (total_scan >= SHRINK_BATCH) {
> -			long this_scan = SHRINK_BATCH;
> -			int shrink_ret;
> -			int nr_before;
> -
> -			nr_before = (*shrinker->shrink)(0, gfp_mask);
> -			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
> -			if (shrink_ret == -1)
> -				break;
> -			if (shrink_ret < nr_before)
> -				ret += nr_before - shrink_ret;
> -			count_vm_events(SLABS_SCANNED, this_scan);
> -			total_scan -= this_scan;
> -
> -			cond_resched();

Removing this means we need cond_resched() in all shrinker loops now
to maintain the same latencies as we currently have. I note that
you've done this for most of the shrinkers, but the documentation
needs to be updated to mention this...


> -		}
> -
> -		shrinker->nr += total_scan;

And dropping this means we do not carry over the remainder of the
previous scan into the next scan. This means we could be scanning a
lot less with this new code.

> +		(*shrinker->shrink)(zone, scanned, total, global, gfp_mask);
>  	}
>  	up_read(&shrinker_rwsem);
>  	return ret;
>  }
>  
> +void shrink_all_slab(void)
> +{
> +	struct zone *zone;
> +	unsigned long nr;
> +
> +again:
> +	nr = 0;
> +	for_each_zone(zone)
> +		nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> +	if (nr >= 10)
> +		goto again;

	do {
		nr = 0;
		for_each_zone(zone)
			nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
	} while (nr >= 10);

> @@ -1705,6 +1708,23 @@ static void shrink_zone(int priority, st
>  	if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
> +	/*
> +	 * Don't shrink slabs when reclaiming memory from
> +	 * over limit cgroups
> +	 */
> +	if (scanning_global_lru(sc)) {
> +		struct reclaim_state *reclaim_state = current->reclaim_state;
> +
> +		shrink_slab(zone, sc->nr_scanned - nr_scanned,
> +			lru_pages, global_lru_pages, sc->gfp_mask);
> +		if (reclaim_state) {
> +			nr_reclaimed += reclaim_state->reclaimed_slab;
> +			reclaim_state->reclaimed_slab = 0;
> +		}
> +	}

So effectively we are going to be calling shrink_slab() once per
zone instead of once per priority loop, right? That means we are
going to be doing a lot more concurrent shrink_slab() calls that the
current code. Combine that with the removal of residual aggregation,
I think this will alter the reclaim balance somewhat. Have you tried
to quantify this?

> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -748,20 +748,26 @@ again2:
>   *
>   * This function may fail to free any resources if all the dentries are in use.
>   */
> -static void prune_dcache(int count)
> +static void prune_dcache(struct zone *zone, unsigned long scanned,
> +			unsigned long total, gfp_t gfp_mask)
> +
>  {
> +	unsigned long nr_to_scan;
>  	struct super_block *sb, *n;
>  	int w_count;
> -	int unused = dentry_stat.nr_unused;
>  	int prune_ratio;
> -	int pruned;
> +	int count, pruned;
>  
> -	if (unused == 0 || count == 0)
> +	shrinker_add_scan(&nr_to_scan, scanned, total, dentry_stat.nr_unused,
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
> +done:
> +	count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (dentry_stat.nr_unused == 0 || count == 0)
>  		return;
> -	if (count >= unused)
> +	if (count >= dentry_stat.nr_unused)
>  		prune_ratio = 1;
>  	else
> -		prune_ratio = unused / count;
> +		prune_ratio = dentry_stat.nr_unused / count;
>  	spin_lock(&sb_lock);
>  	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
>  		if (list_empty(&sb->s_instances))
> @@ -810,6 +816,10 @@ static void prune_dcache(int count)
>  			break;
>  	}
>  	spin_unlock(&sb_lock);
> +	if (count <= 0) {
> +		cond_resched();
> +		goto done;
> +	}
>  }
>  
>  /**
> @@ -1176,19 +1186,15 @@ EXPORT_SYMBOL(shrink_dcache_parent);
>   *
>   * In this case we return -1 to tell the caller that we baled.
>   */
> -static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
> +static int shrink_dcache_memory(struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> -	if (nr) {
> -		if (!(gfp_mask & __GFP_FS))
> -			return -1;
> -		prune_dcache(nr);
> -	}
> -	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
> +	prune_dcache(zone, scanned, global, gfp_mask);
> +	return 0;
>  }

I would have thought that putting the shrinker_add_scan/
shrinker_do_scan loop in shrink_dcache_memory() and leaving
prune_dcache untouched would have been a better separation.
I note that this is what you did with prune_icache(), so consistency
between the two would be good ;)

Also, this patch drops the __GFP_FS check from the dcache shrinker -
not intentional, right?

> @@ -211,28 +215,38 @@ mb_cache_shrink_fn(int nr_to_scan, gfp_t
>  			  atomic_read(&cache->c_entry_count));
>  		count += atomic_read(&cache->c_entry_count);
>  	}
> +	shrinker_add_scan(&nr_to_scan, scanned, global, count,
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
>  	mb_debug("trying to free %d entries", nr_to_scan);
> -	if (nr_to_scan == 0) {
> +
> +again:
> +	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!nr) {
>  		spin_unlock(&mb_cache_spinlock);
> -		goto out;
> +		return 0;
>  	}
> -	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
> +	while (!list_empty(&mb_cache_lru_list)) {
>  		struct mb_cache_entry *ce =
>  			list_entry(mb_cache_lru_list.next,
>  				   struct mb_cache_entry, e_lru_list);
>  		list_move_tail(&ce->e_lru_list, &free_list);
>  		__mb_cache_entry_unhash(ce);
> +		cond_resched_lock(&mb_cache_spinlock);
> +		if (!--nr)
> +			break;
>  	}
>  	spin_unlock(&mb_cache_spinlock);
>  	list_for_each_safe(l, ltmp, &free_list) {
>  		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
>  						   e_lru_list), gfp_mask);
>  	}
> -out:
> -	return (count / 100) * sysctl_vfs_cache_pressure;
> +	if (!nr) {
> +		spin_lock(&mb_cache_spinlock);
> +		goto again;
> +	}

Another candidate for a do-while loop.

> +	return 0;
>  }
>  
> -
>  /*
>   * mb_cache_create()  create a new cache
>   *
> Index: linux-2.6/fs/nfs/dir.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/dir.c
> +++ linux-2.6/fs/nfs/dir.c
> @@ -1709,21 +1709,31 @@ static void nfs_access_free_list(struct
>  	}
>  }
>  
> -int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
> +int nfs_access_cache_shrinker(struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
>  	LIST_HEAD(head);
> -	struct nfs_inode *nfsi;
>  	struct nfs_access_entry *cache;
> -
> -	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
> -		return (nr_to_scan == 0) ? 0 : -1;
> +	unsigned long nr;
>  
>  	spin_lock(&nfs_access_lru_lock);
> -	list_for_each_entry(nfsi, &nfs_access_lru_list, access_cache_inode_lru) {
> +	shrinker_add_scan(&nr_to_scan, scanned, global,
> +			atomic_long_read(&nfs_access_nr_entries),
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
> +	if (!(gfp_mask & __GFP_FS) || nr_to_scan < SHRINK_BATCH) {
> +		spin_unlock(&nfs_access_lru_lock);
> +		return 0;
> +	}
> +	nr = ACCESS_ONCE(nr_to_scan);
> +	nr_to_scan = 0;

That's not safe for concurrent callers. Both could get nr =
nr_to_scan rather than nr(1) = nr_to_scan and nr(2) = 0 which I
think is the intent....

> Index: linux-2.6/arch/x86/kvm/mmu.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kvm/mmu.c
> +++ linux-2.6/arch/x86/kvm/mmu.c
> @@ -2924,14 +2924,29 @@ static int kvm_mmu_remove_some_alloc_mmu
>  	return kvm_mmu_zap_page(kvm, page) + 1;
>  }
>  
> -static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
> +static int mmu_shrink(struct zone *zone, unsigned long scanned,
> +                unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
>  	struct kvm *kvm;
>  	struct kvm *kvm_freed = NULL;
> -	int cache_count = 0;
> +	unsigned long cache_count = 0;
>  
>  	spin_lock(&kvm_lock);
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		cache_count += kvm->arch.n_alloc_mmu_pages -
> +			 kvm->arch.n_free_mmu_pages;
> +	}
>  
> +	shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
> +			DEFAULT_SEEKS*10);
> +
> +done:
> +	cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!cache_count) {
> +		spin_unlock(&kvm_lock);
> +		return 0;
> +	}

I note that this use of a static scan count is thread safe because
all the calculations are done under the kvm_lock. THat's three
different ways the shrinkers implement the same functionality
now....

> Index: linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_sync.c
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
> @@ -838,43 +838,52 @@ static struct rw_semaphore xfs_mount_lis
>  
>  static int
>  xfs_reclaim_inode_shrink(
> -	int		nr_to_scan,
> +	struct zone	*zone,
> +	unsigned long	scanned,
> +	unsigned long	total,
> +	unsigned long	global,
>  	gfp_t		gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
> +	int		nr;
>  	struct xfs_mount *mp;
>  	struct xfs_perag *pag;
>  	xfs_agnumber_t	ag;
> -	int		reclaimable = 0;
> -
> -	if (nr_to_scan) {
> -		if (!(gfp_mask & __GFP_FS))
> -			return -1;
> -
> -		down_read(&xfs_mount_list_lock);
> -		list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> -			xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> -					XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
> -			if (nr_to_scan <= 0)
> -				break;
> -		}
> -		up_read(&xfs_mount_list_lock);
> -	}
> +	unsigned long	nr_reclaimable = 0;
>  
>  	down_read(&xfs_mount_list_lock);
>  	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
>  		for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
>  			pag = xfs_perag_get(mp, ag);
> -			reclaimable += pag->pag_ici_reclaimable;
> +			nr_reclaimable += pag->pag_ici_reclaimable;
>  			xfs_perag_put(pag);
>  		}
>  	}
> +	shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
> +				DEFAULT_SEEKS);

That's not thread safe - it's under a read lock. This code really
needs a shrinker context....

> +	if (!(gfp_mask & __GFP_FS)) {
> +		up_read(&xfs_mount_list_lock);
> +		return 0;
> +	}
> +
> +done:
> +	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!nr) {
> +		up_read(&xfs_mount_list_lock);
> +		return 0;
> +	}
> +	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> +		xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> +				XFS_ICI_RECLAIM_TAG, 1, &nr);
> +		if (nr <= 0)
> +			goto done;
> +	}

That's missing conditional reschedules....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: npiggin@suse.de
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Stultz <johnstul@us.ibm.com>,
	Frank Mayhar <fmayhar@google.com>
Subject: Re: [patch 50/52] mm: implement per-zone shrinker
Date: Wed, 30 Jun 2010 16:28:58 +1000	[thread overview]
Message-ID: <20100630062858.GE24712@dastard> (raw)
In-Reply-To: <20100624030733.676440935@suse.de>

On Thu, Jun 24, 2010 at 01:03:02PM +1000, npiggin@suse.de wrote:
> Allow the shrinker to do per-zone shrinking. This means it is called for
> each zone scanned. The shrinker is now completely responsible for calculating
> and batching (given helpers), which provides better flexibility.
> 
> Finding the ratio of objects to scan requires scaling the ratio of pagecache
> objects scanned. By passing down both the per-zone and the global reclaimable
> pages, per-zone caches and global caches can be calculated correctly.
> 
> Finally, add some fixed-point scaling to the ratio, which helps calculations.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  fs/dcache.c        |    2 
>  fs/drop_caches.c   |    2 
>  fs/inode.c         |    2 
>  fs/mbcache.c       |    4 -
>  fs/nfs/dir.c       |    2 
>  fs/nfs/internal.h  |    2 
>  fs/quota/dquot.c   |    2 
>  include/linux/mm.h |    6 +-
>  mm/vmscan.c        |  131 ++++++++++++++---------------------------------------
>  9 files changed, 47 insertions(+), 106 deletions(-)

The diffstat doesn't match the patch ;)


> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -999,16 +999,19 @@ static inline void sync_mm_rss(struct ta
>   * querying the cache size, so a fastpath for that case is appropriate.
>   */
>  struct shrinker {
> -	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
> -	int seeks;	/* seeks to recreate an obj */
> -
> +	int (*shrink)(struct zone *zone, unsigned long scanned, unsigned long total,
> +					unsigned long global, gfp_t gfp_mask);

Can we add the shrinker structure to taht callback, too, so that we
can get away from needing global context for the shrinker?


> +unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch)
> +{
> +	unsigned long nr = ACCESS_ONCE(*dst);

What's the point of ACCESS_ONCE() here?

/me gets most of the way into the patch

Oh, it's because you are using static variables for nr_to_scan and
hence when concurrent shrinkers are running they are all
incrementing and decrementing the same variable. That doesn't sound
like a good idea to me - concurrent shrinkers are much more likely
with per-zone shrinker callouts. It seems to me that a reclaim
thread could be kept in a shrinker long after it has run it's
scan count if new shrinker calls from a different reclaim context
occur before the first has finished....

As a further question - why do some shrinkerѕ get converted to a
single global nr_to_scan, and others get converted to a private
nr_to_scan? Shouldn't they all use the same method? The static
variable method looks to me to be full of races - concurrent callers
to shrinker_add_scan() does not look at all thread safe to me.

> +	if (nr < batch)
> +		return 0;

Why wouldn't we return nr here to drain the remaining objects?
Doesn't this mean we can't shrink caches that have a scan count of
less than SHRINK_BATCH?

> +	*dst = nr - batch;

Similarly, that is not a threadsafe update.

> +	return batch;
> +}
> +EXPORT_SYMBOL(shrinker_do_scan);
> +
>  /*
>   * Call the shrink functions to age shrinkable caches
>   *
> @@ -198,8 +228,8 @@ EXPORT_SYMBOL(unregister_shrinker);
>   *
>   * Returns the number of slab objects which we shrunk.
>   */
> -unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> -			unsigned long lru_pages)
> +static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
> +			unsigned long global, gfp_t gfp_mask)
>  {
>  	struct shrinker *shrinker;
>  	unsigned long ret = 0;
> @@ -211,55 +241,25 @@ unsigned long shrink_slab(unsigned long
>  		return 1;	/* Assume we'll be able to shrink next time */
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		unsigned long long delta;
> -		unsigned long total_scan;
> -		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
> -
> -		delta = (4 * scanned) / shrinker->seeks;
> -		delta *= max_pass;
> -		do_div(delta, lru_pages + 1);
> -		shrinker->nr += delta;
> -		if (shrinker->nr < 0) {
> -			printk(KERN_ERR "shrink_slab: %pF negative objects to "
> -			       "delete nr=%ld\n",
> -			       shrinker->shrink, shrinker->nr);
> -			shrinker->nr = max_pass;
> -		}
> -
> -		/*
> -		 * Avoid risking looping forever due to too large nr value:
> -		 * never try to free more than twice the estimate number of
> -		 * freeable entries.
> -		 */
> -		if (shrinker->nr > max_pass * 2)
> -			shrinker->nr = max_pass * 2;
> -
> -		total_scan = shrinker->nr;
> -		shrinker->nr = 0;
> -
> -		while (total_scan >= SHRINK_BATCH) {
> -			long this_scan = SHRINK_BATCH;
> -			int shrink_ret;
> -			int nr_before;
> -
> -			nr_before = (*shrinker->shrink)(0, gfp_mask);
> -			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
> -			if (shrink_ret == -1)
> -				break;
> -			if (shrink_ret < nr_before)
> -				ret += nr_before - shrink_ret;
> -			count_vm_events(SLABS_SCANNED, this_scan);
> -			total_scan -= this_scan;
> -
> -			cond_resched();

Removing this means we need cond_resched() in all shrinker loops now
to maintain the same latencies as we currently have. I note that
you've done this for most of the shrinkers, but the documentation
needs to be updated to mention this...


> -		}
> -
> -		shrinker->nr += total_scan;

And dropping this means we do not carry over the remainder of the
previous scan into the next scan. This means we could be scanning a
lot less with this new code.

> +		(*shrinker->shrink)(zone, scanned, total, global, gfp_mask);
>  	}
>  	up_read(&shrinker_rwsem);
>  	return ret;
>  }
>  
> +void shrink_all_slab(void)
> +{
> +	struct zone *zone;
> +	unsigned long nr;
> +
> +again:
> +	nr = 0;
> +	for_each_zone(zone)
> +		nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> +	if (nr >= 10)
> +		goto again;

	do {
		nr = 0;
		for_each_zone(zone)
			nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
	} while (nr >= 10);

> @@ -1705,6 +1708,23 @@ static void shrink_zone(int priority, st
>  	if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
> +	/*
> +	 * Don't shrink slabs when reclaiming memory from
> +	 * over limit cgroups
> +	 */
> +	if (scanning_global_lru(sc)) {
> +		struct reclaim_state *reclaim_state = current->reclaim_state;
> +
> +		shrink_slab(zone, sc->nr_scanned - nr_scanned,
> +			lru_pages, global_lru_pages, sc->gfp_mask);
> +		if (reclaim_state) {
> +			nr_reclaimed += reclaim_state->reclaimed_slab;
> +			reclaim_state->reclaimed_slab = 0;
> +		}
> +	}

So effectively we are going to be calling shrink_slab() once per
zone instead of once per priority loop, right? That means we are
going to be doing a lot more concurrent shrink_slab() calls that the
current code. Combine that with the removal of residual aggregation,
I think this will alter the reclaim balance somewhat. Have you tried
to quantify this?

> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -748,20 +748,26 @@ again2:
>   *
>   * This function may fail to free any resources if all the dentries are in use.
>   */
> -static void prune_dcache(int count)
> +static void prune_dcache(struct zone *zone, unsigned long scanned,
> +			unsigned long total, gfp_t gfp_mask)
> +
>  {
> +	unsigned long nr_to_scan;
>  	struct super_block *sb, *n;
>  	int w_count;
> -	int unused = dentry_stat.nr_unused;
>  	int prune_ratio;
> -	int pruned;
> +	int count, pruned;
>  
> -	if (unused == 0 || count == 0)
> +	shrinker_add_scan(&nr_to_scan, scanned, total, dentry_stat.nr_unused,
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
> +done:
> +	count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (dentry_stat.nr_unused == 0 || count == 0)
>  		return;
> -	if (count >= unused)
> +	if (count >= dentry_stat.nr_unused)
>  		prune_ratio = 1;
>  	else
> -		prune_ratio = unused / count;
> +		prune_ratio = dentry_stat.nr_unused / count;
>  	spin_lock(&sb_lock);
>  	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
>  		if (list_empty(&sb->s_instances))
> @@ -810,6 +816,10 @@ static void prune_dcache(int count)
>  			break;
>  	}
>  	spin_unlock(&sb_lock);
> +	if (count <= 0) {
> +		cond_resched();
> +		goto done;
> +	}
>  }
>  
>  /**
> @@ -1176,19 +1186,15 @@ EXPORT_SYMBOL(shrink_dcache_parent);
>   *
>   * In this case we return -1 to tell the caller that we baled.
>   */
> -static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
> +static int shrink_dcache_memory(struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> -	if (nr) {
> -		if (!(gfp_mask & __GFP_FS))
> -			return -1;
> -		prune_dcache(nr);
> -	}
> -	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
> +	prune_dcache(zone, scanned, global, gfp_mask);
> +	return 0;
>  }

I would have thought that putting the shrinker_add_scan/
shrinker_do_scan loop in shrink_dcache_memory() and leaving
prune_dcache untouched would have been a better separation.
I note that this is what you did with prune_icache(), so consistency
between the two would be good ;)

Also, this patch drops the __GFP_FS check from the dcache shrinker -
not intentional, right?

> @@ -211,28 +215,38 @@ mb_cache_shrink_fn(int nr_to_scan, gfp_t
>  			  atomic_read(&cache->c_entry_count));
>  		count += atomic_read(&cache->c_entry_count);
>  	}
> +	shrinker_add_scan(&nr_to_scan, scanned, global, count,
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
>  	mb_debug("trying to free %d entries", nr_to_scan);
> -	if (nr_to_scan == 0) {
> +
> +again:
> +	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!nr) {
>  		spin_unlock(&mb_cache_spinlock);
> -		goto out;
> +		return 0;
>  	}
> -	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
> +	while (!list_empty(&mb_cache_lru_list)) {
>  		struct mb_cache_entry *ce =
>  			list_entry(mb_cache_lru_list.next,
>  				   struct mb_cache_entry, e_lru_list);
>  		list_move_tail(&ce->e_lru_list, &free_list);
>  		__mb_cache_entry_unhash(ce);
> +		cond_resched_lock(&mb_cache_spinlock);
> +		if (!--nr)
> +			break;
>  	}
>  	spin_unlock(&mb_cache_spinlock);
>  	list_for_each_safe(l, ltmp, &free_list) {
>  		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
>  						   e_lru_list), gfp_mask);
>  	}
> -out:
> -	return (count / 100) * sysctl_vfs_cache_pressure;
> +	if (!nr) {
> +		spin_lock(&mb_cache_spinlock);
> +		goto again;
> +	}

Another candidate for a do-while loop.

> +	return 0;
>  }
>  
> -
>  /*
>   * mb_cache_create()  create a new cache
>   *
> Index: linux-2.6/fs/nfs/dir.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/dir.c
> +++ linux-2.6/fs/nfs/dir.c
> @@ -1709,21 +1709,31 @@ static void nfs_access_free_list(struct
>  	}
>  }
>  
> -int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
> +int nfs_access_cache_shrinker(struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
>  	LIST_HEAD(head);
> -	struct nfs_inode *nfsi;
>  	struct nfs_access_entry *cache;
> -
> -	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
> -		return (nr_to_scan == 0) ? 0 : -1;
> +	unsigned long nr;
>  
>  	spin_lock(&nfs_access_lru_lock);
> -	list_for_each_entry(nfsi, &nfs_access_lru_list, access_cache_inode_lru) {
> +	shrinker_add_scan(&nr_to_scan, scanned, global,
> +			atomic_long_read(&nfs_access_nr_entries),
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
> +	if (!(gfp_mask & __GFP_FS) || nr_to_scan < SHRINK_BATCH) {
> +		spin_unlock(&nfs_access_lru_lock);
> +		return 0;
> +	}
> +	nr = ACCESS_ONCE(nr_to_scan);
> +	nr_to_scan = 0;

That's not safe for concurrent callers. Both could get nr =
nr_to_scan rather than nr(1) = nr_to_scan and nr(2) = 0 which I
think is the intent....

> Index: linux-2.6/arch/x86/kvm/mmu.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kvm/mmu.c
> +++ linux-2.6/arch/x86/kvm/mmu.c
> @@ -2924,14 +2924,29 @@ static int kvm_mmu_remove_some_alloc_mmu
>  	return kvm_mmu_zap_page(kvm, page) + 1;
>  }
>  
> -static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
> +static int mmu_shrink(struct zone *zone, unsigned long scanned,
> +                unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
>  	struct kvm *kvm;
>  	struct kvm *kvm_freed = NULL;
> -	int cache_count = 0;
> +	unsigned long cache_count = 0;
>  
>  	spin_lock(&kvm_lock);
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		cache_count += kvm->arch.n_alloc_mmu_pages -
> +			 kvm->arch.n_free_mmu_pages;
> +	}
>  
> +	shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
> +			DEFAULT_SEEKS*10);
> +
> +done:
> +	cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!cache_count) {
> +		spin_unlock(&kvm_lock);
> +		return 0;
> +	}

I note that this use of a static scan count is thread safe because
all the calculations are done under the kvm_lock. THat's three
different ways the shrinkers implement the same functionality
now....

> Index: linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_sync.c
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
> @@ -838,43 +838,52 @@ static struct rw_semaphore xfs_mount_lis
>  
>  static int
>  xfs_reclaim_inode_shrink(
> -	int		nr_to_scan,
> +	struct zone	*zone,
> +	unsigned long	scanned,
> +	unsigned long	total,
> +	unsigned long	global,
>  	gfp_t		gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
> +	int		nr;
>  	struct xfs_mount *mp;
>  	struct xfs_perag *pag;
>  	xfs_agnumber_t	ag;
> -	int		reclaimable = 0;
> -
> -	if (nr_to_scan) {
> -		if (!(gfp_mask & __GFP_FS))
> -			return -1;
> -
> -		down_read(&xfs_mount_list_lock);
> -		list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> -			xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> -					XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
> -			if (nr_to_scan <= 0)
> -				break;
> -		}
> -		up_read(&xfs_mount_list_lock);
> -	}
> +	unsigned long	nr_reclaimable = 0;
>  
>  	down_read(&xfs_mount_list_lock);
>  	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
>  		for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
>  			pag = xfs_perag_get(mp, ag);
> -			reclaimable += pag->pag_ici_reclaimable;
> +			nr_reclaimable += pag->pag_ici_reclaimable;
>  			xfs_perag_put(pag);
>  		}
>  	}
> +	shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
> +				DEFAULT_SEEKS);

That's not thread safe - it's under a read lock. This code really
needs a shrinker context....

> +	if (!(gfp_mask & __GFP_FS)) {
> +		up_read(&xfs_mount_list_lock);
> +		return 0;
> +	}
> +
> +done:
> +	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!nr) {
> +		up_read(&xfs_mount_list_lock);
> +		return 0;
> +	}
> +	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> +		xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> +				XFS_ICI_RECLAIM_TAG, 1, &nr);
> +		if (nr <= 0)
> +			goto done;
> +	}

That's missing conditional reschedules....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: npiggin@suse.de
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Stultz <johnstul@us.ibm.com>,
	Frank Mayhar <fmayhar@google.com>
Subject: Re: [patch 50/52] mm: implement per-zone shrinker
Date: Wed, 30 Jun 2010 16:28:58 +1000	[thread overview]
Message-ID: <20100630062858.GE24712@dastard> (raw)
In-Reply-To: <20100624030733.676440935@suse.de>

On Thu, Jun 24, 2010 at 01:03:02PM +1000, npiggin@suse.de wrote:
> Allow the shrinker to do per-zone shrinking. This means it is called for
> each zone scanned. The shrinker is now completely responsible for calculating
> and batching (given helpers), which provides better flexibility.
> 
> Finding the ratio of objects to scan requires scaling the ratio of pagecache
> objects scanned. By passing down both the per-zone and the global reclaimable
> pages, per-zone caches and global caches can be calculated correctly.
> 
> Finally, add some fixed-point scaling to the ratio, which helps calculations.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  fs/dcache.c        |    2 
>  fs/drop_caches.c   |    2 
>  fs/inode.c         |    2 
>  fs/mbcache.c       |    4 -
>  fs/nfs/dir.c       |    2 
>  fs/nfs/internal.h  |    2 
>  fs/quota/dquot.c   |    2 
>  include/linux/mm.h |    6 +-
>  mm/vmscan.c        |  131 ++++++++++++++---------------------------------------
>  9 files changed, 47 insertions(+), 106 deletions(-)

The diffstat doesn't match the patch ;)


> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -999,16 +999,19 @@ static inline void sync_mm_rss(struct ta
>   * querying the cache size, so a fastpath for that case is appropriate.
>   */
>  struct shrinker {
> -	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
> -	int seeks;	/* seeks to recreate an obj */
> -
> +	int (*shrink)(struct zone *zone, unsigned long scanned, unsigned long total,
> +					unsigned long global, gfp_t gfp_mask);

Can we add the shrinker structure to taht callback, too, so that we
can get away from needing global context for the shrinker?


> +unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch)
> +{
> +	unsigned long nr = ACCESS_ONCE(*dst);

What's the point of ACCESS_ONCE() here?

/me gets most of the way into the patch

Oh, it's because you are using static variables for nr_to_scan and
hence when concurrent shrinkers are running they are all
incrementing and decrementing the same variable. That doesn't sound
like a good idea to me - concurrent shrinkers are much more likely
with per-zone shrinker callouts. It seems to me that a reclaim
thread could be kept in a shrinker long after it has run it's
scan count if new shrinker calls from a different reclaim context
occur before the first has finished....

As a further question - why do some shrinkerN? get converted to a
single global nr_to_scan, and others get converted to a private
nr_to_scan? Shouldn't they all use the same method? The static
variable method looks to me to be full of races - concurrent callers
to shrinker_add_scan() does not look at all thread safe to me.

> +	if (nr < batch)
> +		return 0;

Why wouldn't we return nr here to drain the remaining objects?
Doesn't this mean we can't shrink caches that have a scan count of
less than SHRINK_BATCH?

> +	*dst = nr - batch;

Similarly, that is not a threadsafe update.

> +	return batch;
> +}
> +EXPORT_SYMBOL(shrinker_do_scan);
> +
>  /*
>   * Call the shrink functions to age shrinkable caches
>   *
> @@ -198,8 +228,8 @@ EXPORT_SYMBOL(unregister_shrinker);
>   *
>   * Returns the number of slab objects which we shrunk.
>   */
> -unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> -			unsigned long lru_pages)
> +static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
> +			unsigned long global, gfp_t gfp_mask)
>  {
>  	struct shrinker *shrinker;
>  	unsigned long ret = 0;
> @@ -211,55 +241,25 @@ unsigned long shrink_slab(unsigned long
>  		return 1;	/* Assume we'll be able to shrink next time */
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		unsigned long long delta;
> -		unsigned long total_scan;
> -		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
> -
> -		delta = (4 * scanned) / shrinker->seeks;
> -		delta *= max_pass;
> -		do_div(delta, lru_pages + 1);
> -		shrinker->nr += delta;
> -		if (shrinker->nr < 0) {
> -			printk(KERN_ERR "shrink_slab: %pF negative objects to "
> -			       "delete nr=%ld\n",
> -			       shrinker->shrink, shrinker->nr);
> -			shrinker->nr = max_pass;
> -		}
> -
> -		/*
> -		 * Avoid risking looping forever due to too large nr value:
> -		 * never try to free more than twice the estimate number of
> -		 * freeable entries.
> -		 */
> -		if (shrinker->nr > max_pass * 2)
> -			shrinker->nr = max_pass * 2;
> -
> -		total_scan = shrinker->nr;
> -		shrinker->nr = 0;
> -
> -		while (total_scan >= SHRINK_BATCH) {
> -			long this_scan = SHRINK_BATCH;
> -			int shrink_ret;
> -			int nr_before;
> -
> -			nr_before = (*shrinker->shrink)(0, gfp_mask);
> -			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
> -			if (shrink_ret == -1)
> -				break;
> -			if (shrink_ret < nr_before)
> -				ret += nr_before - shrink_ret;
> -			count_vm_events(SLABS_SCANNED, this_scan);
> -			total_scan -= this_scan;
> -
> -			cond_resched();

Removing this means we need cond_resched() in all shrinker loops now
to maintain the same latencies as we currently have. I note that
you've done this for most of the shrinkers, but the documentation
needs to be updated to mention this...


> -		}
> -
> -		shrinker->nr += total_scan;

And dropping this means we do not carry over the remainder of the
previous scan into the next scan. This means we could be scanning a
lot less with this new code.

> +		(*shrinker->shrink)(zone, scanned, total, global, gfp_mask);
>  	}
>  	up_read(&shrinker_rwsem);
>  	return ret;
>  }
>  
> +void shrink_all_slab(void)
> +{
> +	struct zone *zone;
> +	unsigned long nr;
> +
> +again:
> +	nr = 0;
> +	for_each_zone(zone)
> +		nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> +	if (nr >= 10)
> +		goto again;

	do {
		nr = 0;
		for_each_zone(zone)
			nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
	} while (nr >= 10);

> @@ -1705,6 +1708,23 @@ static void shrink_zone(int priority, st
>  	if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
> +	/*
> +	 * Don't shrink slabs when reclaiming memory from
> +	 * over limit cgroups
> +	 */
> +	if (scanning_global_lru(sc)) {
> +		struct reclaim_state *reclaim_state = current->reclaim_state;
> +
> +		shrink_slab(zone, sc->nr_scanned - nr_scanned,
> +			lru_pages, global_lru_pages, sc->gfp_mask);
> +		if (reclaim_state) {
> +			nr_reclaimed += reclaim_state->reclaimed_slab;
> +			reclaim_state->reclaimed_slab = 0;
> +		}
> +	}

So effectively we are going to be calling shrink_slab() once per
zone instead of once per priority loop, right? That means we are
going to be doing a lot more concurrent shrink_slab() calls that the
current code. Combine that with the removal of residual aggregation,
I think this will alter the reclaim balance somewhat. Have you tried
to quantify this?

> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -748,20 +748,26 @@ again2:
>   *
>   * This function may fail to free any resources if all the dentries are in use.
>   */
> -static void prune_dcache(int count)
> +static void prune_dcache(struct zone *zone, unsigned long scanned,
> +			unsigned long total, gfp_t gfp_mask)
> +
>  {
> +	unsigned long nr_to_scan;
>  	struct super_block *sb, *n;
>  	int w_count;
> -	int unused = dentry_stat.nr_unused;
>  	int prune_ratio;
> -	int pruned;
> +	int count, pruned;
>  
> -	if (unused == 0 || count == 0)
> +	shrinker_add_scan(&nr_to_scan, scanned, total, dentry_stat.nr_unused,
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
> +done:
> +	count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (dentry_stat.nr_unused == 0 || count == 0)
>  		return;
> -	if (count >= unused)
> +	if (count >= dentry_stat.nr_unused)
>  		prune_ratio = 1;
>  	else
> -		prune_ratio = unused / count;
> +		prune_ratio = dentry_stat.nr_unused / count;
>  	spin_lock(&sb_lock);
>  	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
>  		if (list_empty(&sb->s_instances))
> @@ -810,6 +816,10 @@ static void prune_dcache(int count)
>  			break;
>  	}
>  	spin_unlock(&sb_lock);
> +	if (count <= 0) {
> +		cond_resched();
> +		goto done;
> +	}
>  }
>  
>  /**
> @@ -1176,19 +1186,15 @@ EXPORT_SYMBOL(shrink_dcache_parent);
>   *
>   * In this case we return -1 to tell the caller that we baled.
>   */
> -static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
> +static int shrink_dcache_memory(struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> -	if (nr) {
> -		if (!(gfp_mask & __GFP_FS))
> -			return -1;
> -		prune_dcache(nr);
> -	}
> -	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
> +	prune_dcache(zone, scanned, global, gfp_mask);
> +	return 0;
>  }

I would have thought that putting the shrinker_add_scan/
shrinker_do_scan loop in shrink_dcache_memory() and leaving
prune_dcache untouched would have been a better separation.
I note that this is what you did with prune_icache(), so consistency
between the two would be good ;)

Also, this patch drops the __GFP_FS check from the dcache shrinker -
not intentional, right?

> @@ -211,28 +215,38 @@ mb_cache_shrink_fn(int nr_to_scan, gfp_t
>  			  atomic_read(&cache->c_entry_count));
>  		count += atomic_read(&cache->c_entry_count);
>  	}
> +	shrinker_add_scan(&nr_to_scan, scanned, global, count,
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
>  	mb_debug("trying to free %d entries", nr_to_scan);
> -	if (nr_to_scan == 0) {
> +
> +again:
> +	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!nr) {
>  		spin_unlock(&mb_cache_spinlock);
> -		goto out;
> +		return 0;
>  	}
> -	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
> +	while (!list_empty(&mb_cache_lru_list)) {
>  		struct mb_cache_entry *ce =
>  			list_entry(mb_cache_lru_list.next,
>  				   struct mb_cache_entry, e_lru_list);
>  		list_move_tail(&ce->e_lru_list, &free_list);
>  		__mb_cache_entry_unhash(ce);
> +		cond_resched_lock(&mb_cache_spinlock);
> +		if (!--nr)
> +			break;
>  	}
>  	spin_unlock(&mb_cache_spinlock);
>  	list_for_each_safe(l, ltmp, &free_list) {
>  		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
>  						   e_lru_list), gfp_mask);
>  	}
> -out:
> -	return (count / 100) * sysctl_vfs_cache_pressure;
> +	if (!nr) {
> +		spin_lock(&mb_cache_spinlock);
> +		goto again;
> +	}

Another candidate for a do-while loop.

> +	return 0;
>  }
>  
> -
>  /*
>   * mb_cache_create()  create a new cache
>   *
> Index: linux-2.6/fs/nfs/dir.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/dir.c
> +++ linux-2.6/fs/nfs/dir.c
> @@ -1709,21 +1709,31 @@ static void nfs_access_free_list(struct
>  	}
>  }
>  
> -int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
> +int nfs_access_cache_shrinker(struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
>  	LIST_HEAD(head);
> -	struct nfs_inode *nfsi;
>  	struct nfs_access_entry *cache;
> -
> -	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
> -		return (nr_to_scan == 0) ? 0 : -1;
> +	unsigned long nr;
>  
>  	spin_lock(&nfs_access_lru_lock);
> -	list_for_each_entry(nfsi, &nfs_access_lru_list, access_cache_inode_lru) {
> +	shrinker_add_scan(&nr_to_scan, scanned, global,
> +			atomic_long_read(&nfs_access_nr_entries),
> +			DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
> +	if (!(gfp_mask & __GFP_FS) || nr_to_scan < SHRINK_BATCH) {
> +		spin_unlock(&nfs_access_lru_lock);
> +		return 0;
> +	}
> +	nr = ACCESS_ONCE(nr_to_scan);
> +	nr_to_scan = 0;

That's not safe for concurrent callers. Both could get nr =
nr_to_scan rather than nr(1) = nr_to_scan and nr(2) = 0 which I
think is the intent....

> Index: linux-2.6/arch/x86/kvm/mmu.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kvm/mmu.c
> +++ linux-2.6/arch/x86/kvm/mmu.c
> @@ -2924,14 +2924,29 @@ static int kvm_mmu_remove_some_alloc_mmu
>  	return kvm_mmu_zap_page(kvm, page) + 1;
>  }
>  
> -static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
> +static int mmu_shrink(struct zone *zone, unsigned long scanned,
> +                unsigned long total, unsigned long global, gfp_t gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
>  	struct kvm *kvm;
>  	struct kvm *kvm_freed = NULL;
> -	int cache_count = 0;
> +	unsigned long cache_count = 0;
>  
>  	spin_lock(&kvm_lock);
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		cache_count += kvm->arch.n_alloc_mmu_pages -
> +			 kvm->arch.n_free_mmu_pages;
> +	}
>  
> +	shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
> +			DEFAULT_SEEKS*10);
> +
> +done:
> +	cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!cache_count) {
> +		spin_unlock(&kvm_lock);
> +		return 0;
> +	}

I note that this use of a static scan count is thread safe because
all the calculations are done under the kvm_lock. THat's three
different ways the shrinkers implement the same functionality
now....

> Index: linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_sync.c
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
> @@ -838,43 +838,52 @@ static struct rw_semaphore xfs_mount_lis
>  
>  static int
>  xfs_reclaim_inode_shrink(
> -	int		nr_to_scan,
> +	struct zone	*zone,
> +	unsigned long	scanned,
> +	unsigned long	total,
> +	unsigned long	global,
>  	gfp_t		gfp_mask)
>  {
> +	static unsigned long nr_to_scan;
> +	int		nr;
>  	struct xfs_mount *mp;
>  	struct xfs_perag *pag;
>  	xfs_agnumber_t	ag;
> -	int		reclaimable = 0;
> -
> -	if (nr_to_scan) {
> -		if (!(gfp_mask & __GFP_FS))
> -			return -1;
> -
> -		down_read(&xfs_mount_list_lock);
> -		list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> -			xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> -					XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
> -			if (nr_to_scan <= 0)
> -				break;
> -		}
> -		up_read(&xfs_mount_list_lock);
> -	}
> +	unsigned long	nr_reclaimable = 0;
>  
>  	down_read(&xfs_mount_list_lock);
>  	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
>  		for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
>  			pag = xfs_perag_get(mp, ag);
> -			reclaimable += pag->pag_ici_reclaimable;
> +			nr_reclaimable += pag->pag_ici_reclaimable;
>  			xfs_perag_put(pag);
>  		}
>  	}
> +	shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
> +				DEFAULT_SEEKS);

That's not thread safe - it's under a read lock. This code really
needs a shrinker context....

> +	if (!(gfp_mask & __GFP_FS)) {
> +		up_read(&xfs_mount_list_lock);
> +		return 0;
> +	}
> +
> +done:
> +	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
> +	if (!nr) {
> +		up_read(&xfs_mount_list_lock);
> +		return 0;
> +	}
> +	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> +		xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> +				XFS_ICI_RECLAIM_TAG, 1, &nr);
> +		if (nr <= 0)
> +			goto done;
> +	}

That's missing conditional reschedules....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2010-06-30  6:29 UTC|newest]

Thread overview: 165+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-24  3:02 [patch 00/52] vfs scalability patches updated npiggin
2010-06-24  3:02 ` [patch 01/52] kernel: add bl_list npiggin
2010-06-24  6:04   ` Eric Dumazet
2010-06-24 14:42     ` Nick Piggin
2010-06-24 14:42       ` Nick Piggin
2010-06-24 16:01       ` Eric Dumazet
2010-06-24 16:01         ` Eric Dumazet
2010-06-28 21:37   ` Paul E. McKenney
2010-06-29  6:30     ` Nick Piggin
2010-06-24  3:02 ` [patch 02/52] fs: fix superblock iteration race npiggin
2010-06-29 13:02   ` Christoph Hellwig
2010-06-29 14:56     ` Nick Piggin
2010-06-29 17:35       ` Linus Torvalds
2010-06-29 17:41         ` Nick Piggin
2010-06-29 17:52           ` Linus Torvalds
2010-06-29 17:58             ` Linus Torvalds
2010-06-29 20:04               ` Chris Clayton
2010-06-29 20:14                 ` Nick Piggin
2010-06-29 20:38                   ` Chris Clayton
2010-06-30  7:13                     ` Chris Clayton
2010-06-30 12:51               ` Al Viro
2010-06-24  3:02 ` [patch 03/52] fs: fs_struct rwlock to spinlock npiggin
2010-06-24  3:02 ` [patch 04/52] fs: cleanup files_lock npiggin
2010-06-24  3:02 ` [patch 05/52] lglock: introduce special lglock and brlock spin locks npiggin
2010-06-24 18:15   ` Thomas Gleixner
2010-06-25  6:22     ` Nick Piggin
2010-06-25  9:50       ` Thomas Gleixner
2010-06-25 10:11         ` Nick Piggin
2010-06-24  3:02 ` [patch 06/52] fs: scale files_lock npiggin
2010-06-24  7:52   ` Peter Zijlstra
2010-06-24 15:00     ` Nick Piggin
2010-06-24  3:02 ` [patch 07/52] fs: brlock vfsmount_lock npiggin
2010-06-24  3:02 ` [patch 08/52] fs: scale mntget/mntput npiggin
2010-06-24  3:02 ` [patch 09/52] fs: dcache scale hash npiggin
2010-06-24  3:02 ` [patch 10/52] fs: dcache scale lru npiggin
2010-06-24  3:02 ` [patch 11/52] fs: dcache scale nr_dentry npiggin
2010-06-24  3:02 ` [patch 12/52] fs: dcache scale dentry refcount npiggin
2010-06-24  3:02 ` [patch 13/52] fs: dcache scale d_unhashed npiggin
2010-06-24  3:02 ` [patch 14/52] fs: dcache scale subdirs npiggin
2010-06-24  7:56   ` Peter Zijlstra
2010-06-24  9:50   ` Andi Kleen
2010-06-24 15:53     ` Nick Piggin
2010-06-24  3:02 ` [patch 15/52] fs: dcache scale inode alias list npiggin
2010-06-24  3:02 ` [patch 16/52] fs: dcache RCU for multi-step operaitons npiggin
2010-06-24  7:58   ` Peter Zijlstra
2010-06-24 15:03     ` Nick Piggin
2010-06-24 17:22       ` john stultz
2010-06-24 17:26   ` john stultz
2010-06-25  6:45     ` Nick Piggin
2010-06-24  3:02 ` [patch 17/52] fs: dcache remove dcache_lock npiggin
2010-06-24  3:02 ` [patch 18/52] fs: dcache reduce dput locking npiggin
2010-06-24  3:02 ` [patch 19/52] fs: dcache per-bucket dcache hash locking npiggin
2010-06-24  3:02 ` [patch 20/52] fs: dcache reduce dcache_inode_lock npiggin
2010-06-24  3:02 ` [patch 21/52] fs: dcache per-inode inode alias locking npiggin
2010-06-24  3:02 ` [patch 22/52] fs: dcache rationalise dget variants npiggin
2010-06-24  3:02 ` [patch 23/52] fs: dcache percpu nr_dentry npiggin
2010-06-24  3:02 ` [patch 24/52] fs: dcache reduce d_parent locking npiggin
2010-06-24  8:44   ` Peter Zijlstra
2010-06-24 15:07     ` Nick Piggin
2010-06-24 15:32       ` Paul E. McKenney
2010-06-24 16:05         ` Nick Piggin
2010-06-24 16:41           ` Paul E. McKenney
2010-06-28 21:50   ` Paul E. McKenney
2010-07-07 14:35     ` Nick Piggin
2010-06-24  3:02 ` [patch 25/52] fs: dcache DCACHE_REFERENCED improve npiggin
2010-06-24  3:02 ` [patch 26/52] fs: icache lock s_inodes list npiggin
2010-06-24  3:02 ` [patch 27/52] fs: icache lock inode hash npiggin
2010-06-24  3:02 ` [patch 28/52] fs: icache lock i_state npiggin
2010-06-24  3:02 ` [patch 29/52] fs: icache lock i_count npiggin
2010-06-30  7:27   ` Dave Chinner
2010-06-30 12:05     ` Nick Piggin
2010-07-01  2:36       ` Dave Chinner
2010-07-01  7:54         ` Nick Piggin
2010-07-01  9:36           ` Nick Piggin
2010-07-01 16:21           ` Frank Mayhar
2010-07-03  2:03       ` Andrew Morton
2010-07-03  3:41         ` Nick Piggin
2010-07-03  4:31           ` Andrew Morton
2010-07-03  5:06             ` Nick Piggin
2010-07-03  5:18               ` Nick Piggin
2010-07-05 22:41               ` Dave Chinner
2010-07-06  4:34                 ` Nick Piggin
2010-07-06 10:38                   ` Theodore Tso
2010-07-06 13:04                     ` Nick Piggin
2010-07-07 17:00                     ` Frank Mayhar
2010-06-24  3:02 ` [patch 30/52] fs: icache lock lru/writeback lists npiggin
2010-06-24  8:58   ` Peter Zijlstra
2010-06-24 15:09     ` Nick Piggin
2010-06-24 15:13       ` Peter Zijlstra
2010-06-24  3:02 ` [patch 31/52] fs: icache atomic inodes_stat npiggin
2010-06-24  3:02 ` [patch 32/52] fs: icache protect inode state npiggin
2010-06-24  3:02 ` [patch 33/52] fs: icache atomic last_ino, iunique lock npiggin
2010-06-24  3:02 ` [patch 34/52] fs: icache remove inode_lock npiggin
2010-06-24  3:02 ` [patch 35/52] fs: icache factor hash lock into functions npiggin
2010-06-24  3:02 ` [patch 36/52] fs: icache per-bucket inode hash locks npiggin
2010-06-24  3:02 ` [patch 37/52] fs: icache lazy lru npiggin
2010-06-24  9:52   ` Andi Kleen
2010-06-24 15:59     ` Nick Piggin
2010-06-30  8:38   ` Dave Chinner
2010-06-30 12:06     ` Nick Piggin
2010-07-01  2:46       ` Dave Chinner
2010-07-01  7:57         ` Nick Piggin
2010-06-24  3:02 ` [patch 38/52] fs: icache RCU free inodes npiggin
2010-06-30  8:57   ` Dave Chinner
2010-06-30 12:07     ` Nick Piggin
2010-06-24  3:02 ` [patch 39/52] fs: icache rcu walk for i_sb_list npiggin
2010-06-24  3:02 ` [patch 40/52] fs: dcache improve scalability of pseudo filesystems npiggin
2010-06-24  3:02 ` [patch 41/52] fs: icache reduce atomics npiggin
2010-06-24  3:02 ` [patch 42/52] fs: icache per-cpu last_ino allocator npiggin
2010-06-24  9:48   ` Andi Kleen
2010-06-24 15:52     ` Nick Piggin
2010-06-24 16:19       ` Andi Kleen
2010-06-24 16:38         ` Nick Piggin
2010-06-24  3:02 ` [patch 43/52] fs: icache per-cpu nr_inodes counter npiggin
2010-06-24  3:02 ` [patch 44/52] fs: icache per-CPU sb inode lists and locks npiggin
2010-06-30  9:26   ` Dave Chinner
2010-06-30 12:08     ` Nick Piggin
2010-07-01  3:12       ` Dave Chinner
2010-07-01  8:00         ` Nick Piggin
2010-06-24  3:02 ` [patch 45/52] fs: icache RCU hash lookups npiggin
2010-06-24  3:02 ` [patch 46/52] fs: icache reduce locking npiggin
2010-06-24  3:02 ` [patch 47/52] fs: keep inode with backing-dev npiggin
2010-06-24  3:03 ` [patch 48/52] fs: icache split IO and LRU lists npiggin
2010-06-24  3:03 ` [patch 49/52] fs: icache scale writeback list locking npiggin
2010-06-24  3:03 ` [patch 50/52] mm: implement per-zone shrinker npiggin
2010-06-24  3:03   ` npiggin
2010-06-24 10:06   ` Andi Kleen
2010-06-24 10:06     ` Andi Kleen
2010-06-24 16:00     ` Nick Piggin
2010-06-24 16:00       ` Nick Piggin
2010-06-24 16:27       ` Andi Kleen
2010-06-24 16:27         ` Andi Kleen
2010-06-24 16:32         ` Andi Kleen
2010-06-24 16:32           ` Andi Kleen
2010-06-24 16:37         ` Andi Kleen
2010-06-24 16:37           ` Andi Kleen
2010-06-30  6:28   ` Dave Chinner [this message]
2010-06-30  6:28     ` Dave Chinner
2010-06-30  6:28     ` Dave Chinner
2010-06-30 12:03     ` Nick Piggin
2010-06-30 12:03       ` Nick Piggin
2010-06-30 12:03       ` Nick Piggin
2010-06-24  3:03 ` [patch 51/52] fs: per-zone dentry and inode LRU npiggin
2010-06-30 10:09   ` Dave Chinner
2010-06-30 12:13     ` Nick Piggin
2010-06-24  3:03 ` [patch 52/52] fs: icache less I_FREEING time npiggin
2010-06-30 10:13   ` Dave Chinner
2010-06-30 12:14     ` Nick Piggin
2010-07-01  3:33       ` Dave Chinner
2010-07-01  8:06         ` Nick Piggin
2010-06-25  7:12 ` [patch 00/52] vfs scalability patches updated Christoph Hellwig
2010-06-25  8:05   ` Nick Piggin
2010-06-30 11:30 ` Dave Chinner
2010-06-30 12:40   ` Nick Piggin
2010-06-30 17:09     ` Frank Mayhar
2010-07-01  3:56     ` Dave Chinner
2010-07-01  8:20       ` Nick Piggin
2010-07-01 17:36       ` Andi Kleen
2010-07-01 17:23     ` Nick Piggin
2010-07-01 17:28       ` Andi Kleen
2010-07-06 17:49       ` Nick Piggin
2010-07-01 17:35     ` Linus Torvalds
2010-07-01 17:52       ` Nick Piggin
2010-07-02  4:01       ` Paul E. McKenney
2010-06-30 17:08   ` Frank Mayhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100630062858.GE24712@dastard \
    --to=david@fromorbit.com \
    --cc=fmayhar@google.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.