[PATCH] Avoid useless inodes and dentries reclamation

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Avoid useless inodes and dentries reclamation
@ 2013-08-28 21:52 Tim Chen
  2013-08-28 21:19 ` Kirill A. Shutemov
  2013-08-29 11:07 ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Tim Chen @ 2013-08-28 21:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Dave Chinner
  Cc: Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel,
	linux-kernel

This patch detects that when free inodes and dentries are really
low, their reclamation is skipped so we do not have to contend
on the global sb_lock uselessly under memory pressure. Otherwise
we create a log jam trying to acquire the sb_lock in prune_super(),
with little or no freed memory to show for the effort.

The profile below shows a multi-threaded large file read exerting
pressure on memory with page cache usage.  It is dominated
by the sb_lock contention in the cpu cycles profile.  The patch
eliminates the sb_lock contention almost entirely for prune_super().

    43.94%           usemem  [kernel.kallsyms]             [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |--32.44%-- grab_super_passive
                        |          prune_super
                        |          shrink_slab
                        |          do_try_to_free_pages
                        |          try_to_free_pages
                        |          __alloc_pages_nodemask
                        |          alloc_pages_current
                        |
                        |--32.18%-- put_super
                        |          drop_super
                        |          prune_super
                        |          shrink_slab
                        |          do_try_to_free_pages
                        |          try_to_free_pages
                        |          __alloc_pages_nodemask
                        |          alloc_pages_current

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 fs/super.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 68307c0..70fa26c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
+#define SB_CACHE_LOW 5
 static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
@@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
 		return -1;
 
+	/*
+	 * Don't prune if we have few cached objects to reclaim to
+	 * avoid useless sb_lock contention
+	 */
+	if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
+		return -1;
+
 	if (!grab_super_passive(sb))
 		return -1;
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-28 21:52 [PATCH] Avoid useless inodes and dentries reclamation Tim Chen
@ 2013-08-28 21:19 ` Kirill A. Shutemov
  2013-08-28 22:54   ` Tim Chen
  2013-08-29 11:07 ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2013-08-28 21:19 UTC (permalink / raw)
  To: Tim Chen
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Wed, Aug 28, 2013 at 02:52:12PM -0700, Tim Chen wrote:
> This patch detects that when free inodes and dentries are really
> low, their reclamation is skipped so we do not have to contend
> on the global sb_lock uselessly under memory pressure. Otherwise
> we create a log jam trying to acquire the sb_lock in prune_super(),
> with little or no freed memory to show for the effort.
> 
> The profile below shows a multi-threaded large file read exerting
> pressure on memory with page cache usage.  It is dominated
> by the sb_lock contention in the cpu cycles profile.  The patch
> eliminates the sb_lock contention almost entirely for prune_super().
> 
>     43.94%           usemem  [kernel.kallsyms]             [k] _raw_spin_lock
>                      |
>                      --- _raw_spin_lock
>                         |
>                         |--32.44%-- grab_super_passive
>                         |          prune_super
>                         |          shrink_slab
>                         |          do_try_to_free_pages
>                         |          try_to_free_pages
>                         |          __alloc_pages_nodemask
>                         |          alloc_pages_current
>                         |
>                         |--32.18%-- put_super
>                         |          drop_super
>                         |          prune_super
>                         |          shrink_slab
>                         |          do_try_to_free_pages
>                         |          try_to_free_pages
>                         |          __alloc_pages_nodemask
>                         |          alloc_pages_current
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  fs/super.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 68307c0..70fa26c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
>   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
>   * take a passive reference to the superblock to avoid this from occurring.
>   */
> +#define SB_CACHE_LOW 5
>  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  {
>  	struct super_block *sb;
> @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
>  		return -1;
>  
> +	/*
> +	 * Don't prune if we have few cached objects to reclaim to
> +	 * avoid useless sb_lock contention
> +	 */
> +	if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> +		return -1;

I don't think it's correct: you don't account fs_objects here and
prune_icache_sb() calls invalidate_mapping_pages() which can free a lot of
memory. It's too naive approach. You can miss a memory hog easily this
way.

> +
>  	if (!grab_super_passive(sb))
>  		return -1;
>  

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-28 21:19 ` Kirill A. Shutemov
@ 2013-08-28 22:54   ` Tim Chen
  0 siblings, 0 replies; 13+ messages in thread
From: Tim Chen @ 2013-08-28 22:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Thu, 2013-08-29 at 00:19 +0300, Kirill A. Shutemov wrote:

> > ---
> >  fs/super.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 68307c0..70fa26c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
> >   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
> >   * take a passive reference to the superblock to avoid this from occurring.
> >   */
> > +#define SB_CACHE_LOW 5
> >  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> >  	struct super_block *sb;
> > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> >  		return -1;
> >  
> > +	/*
> > +	 * Don't prune if we have few cached objects to reclaim to
> > +	 * avoid useless sb_lock contention
> > +	 */
> > +	if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> > +		return -1;
> 
> I don't think it's correct: you don't account fs_objects here and
> prune_icache_sb() calls invalidate_mapping_pages() which can free a lot of
> memory. It's too naive approach. You can miss a memory hog easily this
> way.

Is it safe to compute sb->s_op->nr_cached_objects(sb), assuming non null
s_op without holding sb_lock to increment ref count on sb?  
I think it is safe as we hold the shrinker_rwsem so we cannot 
unregister the shrinker and the s_op and sb
structure should still be there.  However, I'm not totally sure.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/fs/super.c b/fs/super.c
index 68307c0..173d0d9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
+#define SB_CACHE_LOW 5
 static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
@@ -68,6 +69,17 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
 		return -1;
 
+	total_objects = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused;
+	if (sb->s_op && sb->s_op->nr_cached_objects)
+		total_objects += sb->s_op->nr_cached_objects(sb);
+
+	/*
+	 * Don't prune if we have few cached objects to reclaim to
+	 * avoid useless sb_lock contention
+	 */
+	if (total_objects <= SB_CACHE_LOW)
+		return -1;
+
 	if (!grab_super_passive(sb))
 		return -1;
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-28 21:52 [PATCH] Avoid useless inodes and dentries reclamation Tim Chen
  2013-08-28 21:19 ` Kirill A. Shutemov
@ 2013-08-29 11:07 ` Dave Chinner
  2013-08-29 18:07   ` Tim Chen
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2013-08-29 11:07 UTC (permalink / raw)
  To: Tim Chen
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Wed, Aug 28, 2013 at 02:52:12PM -0700, Tim Chen wrote:
> This patch detects that when free inodes and dentries are really
> low, their reclamation is skipped so we do not have to contend
> on the global sb_lock uselessly under memory pressure. Otherwise
> we create a log jam trying to acquire the sb_lock in prune_super(),
> with little or no freed memory to show for the effort.
> 
> The profile below shows a multi-threaded large file read exerting
> pressure on memory with page cache usage.  It is dominated
> by the sb_lock contention in the cpu cycles profile.  The patch
> eliminates the sb_lock contention almost entirely for prune_super().
> 
>     43.94%           usemem  [kernel.kallsyms]             [k] _raw_spin_lock
>                      |
>                      --- _raw_spin_lock
>                         |
>                         |--32.44%-- grab_super_passive
>                         |          prune_super
>                         |          shrink_slab
>                         |          do_try_to_free_pages
>                         |          try_to_free_pages
>                         |          __alloc_pages_nodemask
>                         |          alloc_pages_current
>                         |
>                         |--32.18%-- put_super
>                         |          drop_super
>                         |          prune_super
>                         |          shrink_slab
>                         |          do_try_to_free_pages
>                         |          try_to_free_pages
>                         |          __alloc_pages_nodemask
>                         |          alloc_pages_current
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  fs/super.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 68307c0..70fa26c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
>   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
>   * take a passive reference to the superblock to avoid this from occurring.
>   */
> +#define SB_CACHE_LOW 5
>  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  {
>  	struct super_block *sb;
> @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
>  		return -1;
>  
> +	/*
> +	 * Don't prune if we have few cached objects to reclaim to
> +	 * avoid useless sb_lock contention
> +	 */
> +	if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> +		return -1;

Those counters no longer exist in the current mmotm tree and the
shrinker infrastructure is somewhat different, so this patch isn't
the right way to solve this problem.

Given that superblock LRUs and shrinkers in mmotm are node aware,
there may even be more pressure on the sblock in such a workload.  I
think the right way to deal with this is to give the shrinker itself
a "minimum call count" so that we can avoid even attempting to
shrink caches that does have enough entries in them to be worthwhile
shrinking.

That said, the memcg guys have been saying that even small numbers
of items per cache can be meaningful in terms of memory reclaim
(e.g. when there are lots of memcgs) then such a threshold might
only be appropriate for caches that are not memcg controlled. In
that case, handling it in the shrinker infrastructure itself is a
much better idea than hacking thresholds into individual shrinker
callouts.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-29 11:07 ` Dave Chinner
@ 2013-08-29 18:07   ` Tim Chen
  2013-08-29 18:36     ` Dave Hansen
  2013-08-30  1:40     ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Tim Chen @ 2013-08-29 18:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel


> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  fs/super.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 68307c0..70fa26c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
> >   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
> >   * take a passive reference to the superblock to avoid this from occurring.
> >   */
> > +#define SB_CACHE_LOW 5
> >  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> >  	struct super_block *sb;
> > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> >  		return -1;
> >  
> > +	/*
> > +	 * Don't prune if we have few cached objects to reclaim to
> > +	 * avoid useless sb_lock contention
> > +	 */
> > +	if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> > +		return -1;
> 
> Those counters no longer exist in the current mmotm tree and the
> shrinker infrastructure is somewhat different, so this patch isn't
> the right way to solve this problem.

These changes in mmotm tree do complicate solutions for this
scalability issue.

> 
> Given that superblock LRUs and shrinkers in mmotm are node aware,
> there may even be more pressure on the sblock in such a workload.  I
> think the right way to deal with this is to give the shrinker itself
> a "minimum call count" so that we can avoid even attempting to
> shrink caches that does have enough entries in them to be worthwhile
> shrinking.

By "minimum call count", do you mean tracking the number of free
entries per node in the shrinker, and invoking shrinker 
only when the number of free entries
exceed "minimum call count"?  There is some cost in
list_lru_count_node to get the free entries, as we need
to acquire the node's lru lock.  Alternatively, we can
set a special flag/node by list_add or list_del when count goes
above/below a threshold and invoke shrinker based on this flag.

Or do you mean that if we do not reap any memory in a shrink
operation, we do a certain number of backoffs of shrink operation
till the "minimum call count" is reached?


> 
> That said, the memcg guys have been saying that even small numbers
> of items per cache can be meaningful in terms of memory reclaim
> (e.g. when there are lots of memcgs) then such a threshold might
> only be appropriate for caches that are not memcg controlled. 

I've done some experiment with the CACHE thresholds.  Even setting
the threshold at 0 (i.e. there're no free entries) remove almost all 
the needless contentions.  That should make the memcg guys happy by
not holding any extra free entries.

> In
> that case, handling it in the shrinker infrastructure itself is a
> much better idea than hacking thresholds into individual shrinker
> callouts.

Currently the problem is mostly with the sb shrinker due to the
sb_lock.  If we can have a general solution, that will be even
better.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-29 18:07   ` Tim Chen
@ 2013-08-29 18:36     ` Dave Hansen
  2013-08-30  1:56       ` Dave Chinner
  2013-08-30  1:40     ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2013-08-29 18:36 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dave Chinner, Alexander Viro, Jan Kara, Dave Chinner, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

The new shrinker infrastructure in mmotm looks like it will make this
problem worse.

old code:
shrink_slab()
	for_each_shrinker {
		do_shrinker_shrink(); // one per batch
			prune_super()
				grab_super_passive()
	}
}

Which means we've got at _most_ one grab_super_passive() per batch.  The
new code is something like this:

shrink_slab()
{
	list_for_each_entry(shrinker, &shrinker_list, list) {
                for_each_node_mask(... shrinkctl->nodes_to_scan) {
			shrink_slab_node()
		}
	}
}

shrink_slab_node()
{
        max_pass = shrinker->count_objects(shrinker, shrinkctl);
	// ^^ does grab_super_passive()
	...
	while (total_scan >= batch_size) {
		ret = shrinker->scan_objects(shrinker, shrinkctl);
		// ^^ does grab_super_passive()
	}
}

We've got an extra grab_super_passive()s in the case where we are
actually doing a scan, plus we've got the extra for_each_node_mask()
loop.  That means even more lock acquisitions in the multi-node NUMA
case, which is exactly where we want to get rid of global lock acquisitions.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-29 18:36     ` Dave Hansen
@ 2013-08-30  1:56       ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2013-08-30  1:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tim Chen, Alexander Viro, Jan Kara, Dave Chinner, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Thu, Aug 29, 2013 at 11:36:10AM -0700, Dave Hansen wrote:
> The new shrinker infrastructure in mmotm looks like it will make this
> problem worse.
> 
> old code:
> shrink_slab()
> 	for_each_shrinker {
> 		do_shrinker_shrink(); // one per batch
> 			prune_super()
> 				grab_super_passive()
> 	}
> }

I think you've simplified it down too far. The current code does:

	for_each_shrinker {
		max_pass = do_shrinker_shrink(0);
		// ^^ does grab_super_passive()

		while(total_scan >= batch_size) {
			do_shrinker_shrink(0)
			// ^^ does grab_super_passive()
			do_shrinker_shrink(batch_size)
			// ^^ does grab_super_passive()
		}
	}

> Which means we've got at _most_ one grab_super_passive() per batch.

No, there's two. one count, one scan per batch.

> The new code is something like this:
>
> shrink_slab()
> {
> 	list_for_each_entry(shrinker, &shrinker_list, list) {
>                 for_each_node_mask(... shrinkctl->nodes_to_scan) {
> 			shrink_slab_node()
> 		}
> 	}
> }

Right, but what you are missing here is that the nodemask passed in
to shrink_slab() only has a single node bit set during reclaim -
the bit that matches the zone being reclaimed from.

drop_slab(), OTOH, does:

	nodes_setall(shrink.nodes_to_scan);

before calling shrink_slab in a loopi because it's trying to free
*everything*, and that's why the shrink_slab() code handles that
case.

> shrink_slab_node()
> {
>         max_pass = shrinker->count_objects(shrinker, shrinkctl);
> 	// ^^ does grab_super_passive()
> 	...
> 	while (total_scan >= batch_size) {
> 		ret = shrinker->scan_objects(shrinker, shrinkctl);
> 		// ^^ does grab_super_passive()
> 	}
> }
> 
> We've got an extra grab_super_passive()s in the case where we are
> actually doing a scan, plus we've got the extra for_each_node_mask()
> loop.  That means even more lock acquisitions in the multi-node NUMA
> case, which is exactly where we want to get rid of global lock acquisitions.

I disagree.  With direct memory reclaim, we have an identical number
of calls to shrink_slab() occurring, and each target a single node.
hence there is typically a 1:1 call ratio for
shrink_slab:shrink_slab_node. An because shrink_slab_node() has one
less callout per batch iteration, there is an overall reduction in
the number of grab_super_passive calls from the shrinker. Worst case
is no change, best case is a 50% reduction in the number of calls.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-29 18:07   ` Tim Chen
  2013-08-29 18:36     ` Dave Hansen
@ 2013-08-30  1:40     ` Dave Chinner
  2013-08-30 16:21       ` Tim Chen
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2013-08-30  1:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Thu, Aug 29, 2013 at 11:07:56AM -0700, Tim Chen wrote:
> 
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > ---
> > >  fs/super.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/super.c b/fs/super.c
> > > index 68307c0..70fa26c 100644
> > > --- a/fs/super.c
> > > +++ b/fs/super.c
> > > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
> > >   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
> > >   * take a passive reference to the superblock to avoid this from occurring.
> > >   */
> > > +#define SB_CACHE_LOW 5
> > >  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> > >  {
> > >  	struct super_block *sb;
> > > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> > >  	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> > >  		return -1;
> > >  
> > > +	/*
> > > +	 * Don't prune if we have few cached objects to reclaim to
> > > +	 * avoid useless sb_lock contention
> > > +	 */
> > > +	if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> > > +		return -1;
> > 
> > Those counters no longer exist in the current mmotm tree and the
> > shrinker infrastructure is somewhat different, so this patch isn't
> > the right way to solve this problem.
> 
> These changes in mmotm tree do complicate solutions for this
> scalability issue.
> 
> > 
> > Given that superblock LRUs and shrinkers in mmotm are node aware,
> > there may even be more pressure on the sblock in such a workload.  I
> > think the right way to deal with this is to give the shrinker itself
> > a "minimum call count" so that we can avoid even attempting to
> > shrink caches that does have enough entries in them to be worthwhile
> > shrinking.
> 
> By "minimum call count", do you mean tracking the number of free
> entries per node in the shrinker, and invoking shrinker 
> only when the number of free entries
> exceed "minimum call count"?

The new shrinker infrastructure has a ->count_objects() callout to
specifically return the number of objects in the cache.
shrink_slab_node() can check that return value against the "minimum
call count" and determine whether it needs to call ->scan_objects()
at all.

Actually, the shrinker already behaves like this with the batch_size
variable - the shrinker has to be asking for more items to be
scanned than the batch size. That means the problem is that counting
callouts are causing the problem, not the scanning callouts.

With the new code in the mmotm tree, for counting purposes we
probably don't need to grab a passive superblock reference at all -
the superblock and LRUs are guaranteed to be valid if the shrinker
is currently running, but we don't really care if the superblock is
being shutdown and the values that come back are invalid because the
->scan_objects() callout will fail to grab the superblock to do
anything with the calculated values.

Indeed, I've made no attempt to optimise the code in the mmotm tree
- I've been concerned with correctness. The fact that without any
optimisation it significantly lessens contention in my testing has
been sufficient to move forward with.

> There is some cost in
> list_lru_count_node to get the free entries, as we need
> to acquire the node's lru lock.

Right, but we really don't need the node's lru lock to get the count
- reading the count is racy from an external perspective, anyway, so
we can do lockless counting here.  See this patch I proposed a
while back, for example:

https://lkml.org/lkml/2013/7/31/7

> > That said, the memcg guys have been saying that even small numbers
> > of items per cache can be meaningful in terms of memory reclaim
> > (e.g. when there are lots of memcgs) then such a threshold might
> > only be appropriate for caches that are not memcg controlled. 
> 
> I've done some experiment with the CACHE thresholds.  Even setting
> the threshold at 0 (i.e. there're no free entries) remove almost all 
> the needless contentions.  That should make the memcg guys happy by
> not holding any extra free entries.

Ok, that's good to know, and it further indicates that it is the
->count_objects() callout that is the issue, not the
scanning/freeing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-30  1:40     ` Dave Chinner
@ 2013-08-30 16:21       ` Tim Chen
  2013-08-31  9:00         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Tim Chen @ 2013-08-30 16:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Fri, 2013-08-30 at 11:40 +1000, Dave Chinner wrote:

> 
> The new shrinker infrastructure has a ->count_objects() callout to
> specifically return the number of objects in the cache.
> shrink_slab_node() can check that return value against the "minimum
> call count" and determine whether it needs to call ->scan_objects()
> at all.
> 
> Actually, the shrinker already behaves like this with the batch_size
> variable - the shrinker has to be asking for more items to be
> scanned than the batch size. That means the problem is that counting
> callouts are causing the problem, not the scanning callouts.
> 
> With the new code in the mmotm tree, for counting purposes we
> probably don't need to grab a passive superblock reference at all -
> the superblock and LRUs are guaranteed to be valid if the shrinker
> is currently running, but we don't really care if the superblock is
> being shutdown and the values that come back are invalid because the
> ->scan_objects() callout will fail to grab the superblock to do
> anything with the calculated values.

If that's the case, then we should remove grab_super_passive
from the super_cache_count code.  That should remove the bottleneck
in reclamation.

Thanks for your detailed explanation.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/fs/super.c b/fs/super.c
index 73d0952..4df1fab 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
-	if (!grab_super_passive(sb))
-		return 0;
-
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-30 16:21       ` Tim Chen
@ 2013-08-31  9:00         ` Dave Chinner
  2013-09-03 18:38           ` Tim Chen
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2013-08-31  9:00 UTC (permalink / raw)
  To: Tim Chen
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel

On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote:
> On Fri, 2013-08-30 at 11:40 +1000, Dave Chinner wrote:
> 
> > 
> > The new shrinker infrastructure has a ->count_objects() callout to
> > specifically return the number of objects in the cache.
> > shrink_slab_node() can check that return value against the "minimum
> > call count" and determine whether it needs to call ->scan_objects()
> > at all.
> > 
> > Actually, the shrinker already behaves like this with the batch_size
> > variable - the shrinker has to be asking for more items to be
> > scanned than the batch size. That means the problem is that counting
> > callouts are causing the problem, not the scanning callouts.
> > 
> > With the new code in the mmotm tree, for counting purposes we
> > probably don't need to grab a passive superblock reference at all -
> > the superblock and LRUs are guaranteed to be valid if the shrinker
> > is currently running, but we don't really care if the superblock is
> > being shutdown and the values that come back are invalid because the
> > ->scan_objects() callout will fail to grab the superblock to do
> > anything with the calculated values.
> 
> If that's the case, then we should remove grab_super_passive
> from the super_cache_count code.  That should remove the bottleneck
> in reclamation.
> 
> Thanks for your detailed explanation.
> 
> Tim
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> diff --git a/fs/super.c b/fs/super.c
> index 73d0952..4df1fab 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  
>  	sb = container_of(shrink, struct super_block, s_shrink);
>  
> -	if (!grab_super_passive(sb))
> -		return 0;
> -

I think the function needs a comment explaining why we aren't
grabbing the sb here, otherwise people are going to read the code
and ask why it's different to the scanning callout.

>  	if (sb->s_op && sb->s_op->nr_cached_objects)
>  		total_objects = sb->s_op->nr_cached_objects(sb,
>  						 sc->nid);

But seeing this triggered further thought on my part. Being called
during unmount means that ->nr_cached_objects implementations need
to be robust against unmount tearing down private filesystem
structures.  Right now, grab_super_passive() protects us from that
because it won't be able to get the sb->s_umount lock while
generic_shutdown_super() is doing it's work.

IOWs, the superblock based shrinker operations are safe because the
structures don't get torn down until after the shrinker is
unregistered. That's not true for the structures that
->nr_cached_objects() use: ->put_super() tears them down before the
shrinker is unregistered and only grab_super_passive() protects us
from thay.

Let me have a bit more of a think about this - the solution may
simply be unregistering the shrinker before we call ->kill_sb() so
the shrinker can't get called while we are tearing down the fs.
First, though, I need to go back and remind myself of why I put that
after ->kill_sb() in the first place.  If we unregister the shrinker
before ->kill_sb is called, then we can probably get rid of
grab_super_passive() in both shrinker callouts because they will no
longer need to handle running concurrently with ->kill_sb()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-08-31  9:00         ` Dave Chinner
@ 2013-09-03 18:38           ` Tim Chen
  2013-09-06  0:55             ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Tim Chen @ 2013-09-03 18:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel, Kirill A. Shutemov

On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote:
> On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote:
> >
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> > diff --git a/fs/super.c b/fs/super.c
> > index 73d0952..4df1fab 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> >  
> >  	sb = container_of(shrink, struct super_block, s_shrink);
> >  
> > -	if (!grab_super_passive(sb))
> > -		return 0;
> > -
> 
> I think the function needs a comment explaining why we aren't
> grabbing the sb here, otherwise people are going to read the code
> and ask why it's different to the scanning callout.
> 
> >  	if (sb->s_op && sb->s_op->nr_cached_objects)
> >  		total_objects = sb->s_op->nr_cached_objects(sb,
> >  						 sc->nid);
> 

Yes, those comments are needed.
I also need to remove the corresponding
	drop_super(sb);

So probably something like:

---
diff --git a/fs/super.c b/fs/super.c
index 73d0952..7b5a6e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
-	if (!grab_super_passive(sb))
-		return 0;
-
+	/*
+	 * Don't call grab_super_passive as it is a potential 
+	 * scalability bottleneck. The counts could get updated 
+	 * between super_cache_count and super_cache_scan anyway.
+	 * Call to super_cache_count with shrinker_rwsem held
+	 * ensures the safety of call to list_lru_count_node() and 
+	 * s_op->nr_cached_objects().
+	 */
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
@@ -125,7 +130,6 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 						 sc->nid);
 
 	total_objects = vfs_pressure_ratio(total_objects);
-	drop_super(sb);
 	return total_objects;
 }
 


> But seeing this triggered further thought on my part. Being called
> during unmount means that ->nr_cached_objects implementations need
> to be robust against unmount tearing down private filesystem
> structures.  Right now, grab_super_passive() protects us from that
> because it won't be able to get the sb->s_umount lock while
> generic_shutdown_super() is doing it's work.
> 
> IOWs, the superblock based shrinker operations are safe because the
> structures don't get torn down until after the shrinker is
> unregistered. That's not true for the structures that
> ->nr_cached_objects() use: ->put_super() tears them down before the
> shrinker is unregistered and only grab_super_passive() protects us
> from thay.
> 
> Let me have a bit more of a think about this - the solution may
> simply be unregistering the shrinker before we call ->kill_sb() so
> the shrinker can't get called while we are tearing down the fs.
> First, though, I need to go back and remind myself of why I put that
> after ->kill_sb() in the first place.  

Seems very reasonable as I haven't found a case where the shrinker 
is touched in ->kill_sb() yet. It looks like unregistering the
shrinker before ->kill_sb() should be okay.

> If we unregister the shrinker
> before ->kill_sb is called, then we can probably get rid of
> grab_super_passive() in both shrinker callouts because they will no
> longer need to handle running concurrently with ->kill_sb()....
> 

Thanks.

Tim

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-09-03 18:38           ` Tim Chen
@ 2013-09-06  0:55             ` Dave Chinner
  2013-09-06 18:26               ` Tim Chen
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2013-09-06  0:55 UTC (permalink / raw)
  To: Tim Chen
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel, Kirill A. Shutemov

On Tue, Sep 03, 2013 at 11:38:27AM -0700, Tim Chen wrote:
> On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote:
> > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote:
> > >
> > > 
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > ---
> > > diff --git a/fs/super.c b/fs/super.c
> > > index 73d0952..4df1fab 100644
> > > --- a/fs/super.c
> > > +++ b/fs/super.c
> > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> > >  
> > >  	sb = container_of(shrink, struct super_block, s_shrink);
> > >  
> > > -	if (!grab_super_passive(sb))
> > > -		return 0;
> > > -
> > 
> > I think the function needs a comment explaining why we aren't
> > grabbing the sb here, otherwise people are going to read the code
> > and ask why it's different to the scanning callout.
> > 
> > >  	if (sb->s_op && sb->s_op->nr_cached_objects)
> > >  		total_objects = sb->s_op->nr_cached_objects(sb,
> > >  						 sc->nid);
> > 
> 
> Yes, those comments are needed.
> I also need to remove the corresponding
> 	drop_super(sb);
> 
> So probably something like:
> 
> ---
> diff --git a/fs/super.c b/fs/super.c
> index 73d0952..7b5a6e5 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  
>  	sb = container_of(shrink, struct super_block, s_shrink);
>  
> -	if (!grab_super_passive(sb))
> -		return 0;
> -
> +	/*
> +	 * Don't call grab_super_passive as it is a potential 
> +	 * scalability bottleneck. The counts could get updated 
> +	 * between super_cache_count and super_cache_scan anyway.
> +	 * Call to super_cache_count with shrinker_rwsem held
> +	 * ensures the safety of call to list_lru_count_node() and 
> +	 * s_op->nr_cached_objects().
> +	 */

Well, that's not true of s_op->nr_cached_objects() right now. It's
only going to be true if the shrinker deregistration is moved before
->kill_sb()....

> > Let me have a bit more of a think about this - the solution may
> > simply be unregistering the shrinker before we call ->kill_sb() so
> > the shrinker can't get called while we are tearing down the fs.
> > First, though, I need to go back and remind myself of why I put that
> > after ->kill_sb() in the first place.  
> 
> Seems very reasonable as I haven't found a case where the shrinker 
> is touched in ->kill_sb() yet. It looks like unregistering the
> shrinker before ->kill_sb() should be okay.

Having looked at it some more, I have to agree. I think the original
reason for unregistering the shrinker there was to avoid problems
with locking - the shrinker callouts are run holding the
shrinker_rwsem in read mode, and then we lock the sb->s_umount in
read mount. In the unmount case, we currently take the sb->s_umount
lock in write mode (thereby locking out the shrinker) but we drop it
before deregistering the shrinker and so there is no inverted
locking order.

The thing is, grab_super_passive does a try-lock on the sb->s_umount
now, and so if we are in the unmount process, it won't ever block.
That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:

	shrinker			umount

	down_read(shrinker_rwsem)
					down_write(sb->s_umount)
					shrinker_unregister
					  down_write(shrinker_rwsem)
					    <blocks>
	grab_super_passive(sb)
	  down_read_trylock(sb->s_umount)
	    <fails>
	<shrinker aborts>
	....
	<shrinkers finish running>
	up_read(shrinker_rwsem)
					  <unblocks>
					  <removes shrinker>
					  up_write(shrinker_rwsem)
					->kill_sb()
					....

And so it appears to be safe to deregister the shrinker before
->kill_sb().

Can you do this as two patches? The first moves the shrinker
deregistration to before ->kill_sb(), then second is the above patch
that drops the grab-super_passive() calls from the ->count_objects
function?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Avoid useless inodes and dentries reclamation
  2013-09-06  0:55             ` Dave Chinner
@ 2013-09-06 18:26               ` Tim Chen
  0 siblings, 0 replies; 13+ messages in thread
From: Tim Chen @ 2013-09-06 18:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen,
	Matthew Wilcox, linux-fsdevel, linux-kernel, Kirill A. Shutemov

On Fri, 2013-09-06 at 10:55 +1000, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 11:38:27AM -0700, Tim Chen wrote:
> > On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote:
> > > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote:
> > > >
> > > > 
> > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > ---
> > > > diff --git a/fs/super.c b/fs/super.c
> > > > index 73d0952..4df1fab 100644
> > > > --- a/fs/super.c
> > > > +++ b/fs/super.c
> > > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> > > >  
> > > >  	sb = container_of(shrink, struct super_block, s_shrink);
> > > >  
> > > > -	if (!grab_super_passive(sb))
> > > > -		return 0;
> > > > -
> > > 
> > > I think the function needs a comment explaining why we aren't
> > > grabbing the sb here, otherwise people are going to read the code
> > > and ask why it's different to the scanning callout.
> > > 
> > > >  	if (sb->s_op && sb->s_op->nr_cached_objects)
> > > >  		total_objects = sb->s_op->nr_cached_objects(sb,
> > > >  						 sc->nid);
> > > 
> > 
> > Yes, those comments are needed.
> > I also need to remove the corresponding
> > 	drop_super(sb);
> > 
> > So probably something like:
> > 
> > ---
> > diff --git a/fs/super.c b/fs/super.c
> > index 73d0952..7b5a6e5 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> >  
> >  	sb = container_of(shrink, struct super_block, s_shrink);
> >  
> > -	if (!grab_super_passive(sb))
> > -		return 0;
> > -
> > +	/*
> > +	 * Don't call grab_super_passive as it is a potential 
> > +	 * scalability bottleneck. The counts could get updated 
> > +	 * between super_cache_count and super_cache_scan anyway.
> > +	 * Call to super_cache_count with shrinker_rwsem held
> > +	 * ensures the safety of call to list_lru_count_node() and 
> > +	 * s_op->nr_cached_objects().
> > +	 */
> 
> Well, that's not true of s_op->nr_cached_objects() right now. It's
> only going to be true if the shrinker deregistration is moved before
> ->kill_sb()....
> 
> > > Let me have a bit more of a think about this - the solution may
> > > simply be unregistering the shrinker before we call ->kill_sb() so
> > > the shrinker can't get called while we are tearing down the fs.
> > > First, though, I need to go back and remind myself of why I put that
> > > after ->kill_sb() in the first place.  
> > 
> > Seems very reasonable as I haven't found a case where the shrinker 
> > is touched in ->kill_sb() yet. It looks like unregistering the
> > shrinker before ->kill_sb() should be okay.
> 
> Having looked at it some more, I have to agree. I think the original
> reason for unregistering the shrinker there was to avoid problems
> with locking - the shrinker callouts are run holding the
> shrinker_rwsem in read mode, and then we lock the sb->s_umount in
> read mount. In the unmount case, we currently take the sb->s_umount
> lock in write mode (thereby locking out the shrinker) but we drop it
> before deregistering the shrinker and so there is no inverted
> locking order.
> 
> The thing is, grab_super_passive does a try-lock on the sb->s_umount
> now, and so if we are in the unmount process, it won't ever block.
> That means what used to be a deadlock and races we were avoiding
> by using grab_super_passive() is now:
> 
> 	shrinker			umount
> 
> 	down_read(shrinker_rwsem)
> 					down_write(sb->s_umount)
> 					shrinker_unregister
> 					  down_write(shrinker_rwsem)
> 					    <blocks>
> 	grab_super_passive(sb)
> 	  down_read_trylock(sb->s_umount)
> 	    <fails>
> 	<shrinker aborts>
> 	....
> 	<shrinkers finish running>
> 	up_read(shrinker_rwsem)
> 					  <unblocks>
> 					  <removes shrinker>
> 					  up_write(shrinker_rwsem)
> 					->kill_sb()
> 					....
> 
> And so it appears to be safe to deregister the shrinker before
> ->kill_sb().
> 
> Can you do this as two patches? The first moves the shrinker
> deregistration to before ->kill_sb(), then second is the above patch
> that drops the grab-super_passive() calls from the ->count_objects
> function?

I've sent the patches on a separate mail thread.  Please add
your sign-off if the patches look okay.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-09-06 18:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-28 21:52 [PATCH] Avoid useless inodes and dentries reclamation Tim Chen
2013-08-28 21:19 ` Kirill A. Shutemov
2013-08-28 22:54   ` Tim Chen
2013-08-29 11:07 ` Dave Chinner
2013-08-29 18:07   ` Tim Chen
2013-08-29 18:36     ` Dave Hansen
2013-08-30  1:56       ` Dave Chinner
2013-08-30  1:40     ` Dave Chinner
2013-08-30 16:21       ` Tim Chen
2013-08-31  9:00         ` Dave Chinner
2013-09-03 18:38           ` Tim Chen
2013-09-06  0:55             ` Dave Chinner
2013-09-06 18:26               ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).