Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
       [not found] <201405192340.FCD48964.OFQHOOJLVSFFMt@I-love.SAKURA.ne.jp>
@ 2014-05-20  0:44 ` Dave Chinner
  2014-05-20  3:54   ` Tetsuo Handa
  2014-05-20  5:59   ` Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2014-05-20  0:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: riel, linux-kernel, xfs, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu

[cc xfs@oss.sgi.com]

On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote:
> >From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 19 May 2014 23:24:11 +0900
> Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
> too many pages are isolated already" causes RHEL7 environment to stall with
> 0% CPU usage when a certain type of memory pressure is given.
> 
> Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list().
> xfs_vm_writepage() eventually calls wait_for_completion() which waits for
> xfs_bmapi_allocate_worker().
> 
> Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and
> xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block().
> xfs_btree_lookup_get_block() eventually calls alloc_page().
> alloc_page() eventually calls shrink_inactive_list().
> 
> The stack trace showed that the kernel worker thread which the kswapd is
> waiting for was looping at a while loop in shrink_inactive_list().

[snip stack traces indicating XFS is deferring allocation work to the
allocation workqueue on behalf of kswapd]

> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.

That's like trying to swat a fly with a runaway train. Eventually
you'll hit the fly with the train....

Q: What is missing from the workqueue context that results
in the workqueue being blocked in memory reclaim on congestion?

Hint: XFS is deferring allocation on behalf of kswapd to a
workqueue.

A: PF_KSWAPD.  shrink_inactive_list() has this guard:

        /*
         * Stall direct reclaim for IO completions if underlying BDIs or zone
         * is congested. Allow kswapd to continue until it starts encountering
         * unqueued dirty pages or cycling through the LRU too quickly.
         */
        if (!sc->hibernation_mode && !current_is_kswapd())
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);


So, XFS should be passing kswapd context to the workqueue allocation
context. The patch below does this.

Tetsuo-san, when it comes to problems involving XFS, you should
really CC xfs@oss.sgi.com because very few people really know how
XFS works and even fewer still know how it is supposed to interact
with memory reclaim....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: block allocation work needs to be kswapd aware

From: Dave Chinner <dchinner@redhat.com>

Upon memory pressure, kswapd calls xfs_vm_writepage() from
shrink_page_list(). This can result in delayed allocation occurring
and that gets deferred to the the allocation workqueue.

The allocation then runs outside kswapd context, which means if it
needs memory (and it does to demand page metadata from disk) it can
block in shrink_inactive_list() waiting for IO congestion. These
blocking waits are normally avoiding in kswapd context, so under
memory pressure writeback from kswapd can be arbitrarily delayed by
memory reclaim.

To avoid this, pass the kswapd context to the allocation being done
by the workqueue, so that memory reclaim understands correctly that
the work is being done for kswapd and therefore it is not blocked
and does not delay memory reclaim.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 16 +++++++++++++---
 fs/xfs/xfs_bmap_util.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671..703b3ec 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
 	struct xfs_bmalloca	*args = container_of(work,
 						struct xfs_bmalloca, work);
 	unsigned long		pflags;
+	unsigned long		new_pflags = PF_FSTRANS;
 
-	/* we are in a transaction context here */
-	current_set_flags_nested(&pflags, PF_FSTRANS);
+	/*
+	 * we are in a transaction context here, but may also be doing work
+	 * in kswapd context, and hence we may need to inherit that state
+	 * temporarily to ensure that we don't block waiting for memory reclaim
+	 * in any way.
+	 */
+	if (args->kswapd)
+		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+
+	current_set_flags_nested(&pflags, new_pflags);
 
 	args->result = __xfs_bmapi_allocate(args);
 	complete(args->done);
 
-	current_restore_flags_nested(&pflags, PF_FSTRANS);
+	current_restore_flags_nested(&pflags, new_pflags);
 }
 
 /*
@@ -284,6 +293,7 @@ xfs_bmapi_allocate(
 
 
 	args->done = &done;
+	args->kswapd = current_is_kswapd();
 	INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
 	queue_work(xfs_alloc_wq, &args->work);
 	wait_for_completion(&done);
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 935ed2b..d227f9d 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -56,6 +56,7 @@ struct xfs_bmalloca {
 	char			aeof;	/* allocated space at eof */
 	char			conv;	/* overwriting unwritten extents */
 	char			stack_switch;
+	char			kswapd;	/* allocation in kswapd context */
 	int			flags;
 	struct completion	*done;
 	struct work_struct	work;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  0:44 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Dave Chinner
@ 2014-05-20  3:54   ` Tetsuo Handa
  2014-05-20  5:24     ` Dave Chinner
  2014-05-20  5:59   ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2014-05-20  3:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: riel, linux-kernel, xfs, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu

Dave Chinner wrote:
> So, XFS should be passing kswapd context to the workqueue allocation
> context. The patch below does this.
> 
> Tetsuo-san, when it comes to problems involving XFS, you should
> really CC xfs@oss.sgi.com because very few people really know how
> XFS works and even fewer still know how it is supposed to interact
> with memory reclaim....

Thank you for the patch, but ...

  #define PF_KSWAPD       0x00040000      /* I am kswapd */

  static inline int current_is_kswapd(void)
  {
  	return current->flags & PF_KSWAPD;
  }

I think ((char) (current->flags & 0x00040000)) == 0.
Your patch wants

  -args->kswapd = current_is_kswapd();
  +args->kswapd = (current_is_kswapd() != 0);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  3:54   ` Tetsuo Handa
@ 2014-05-20  5:24     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-05-20  5:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: riel, linux-kernel, xfs, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu

On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > So, XFS should be passing kswapd context to the workqueue allocation
> > context. The patch below does this.
> > 
> > Tetsuo-san, when it comes to problems involving XFS, you should
> > really CC xfs@oss.sgi.com because very few people really know how
> > XFS works and even fewer still know how it is supposed to interact
> > with memory reclaim....
> 
> Thank you for the patch, but ...
> 
>   #define PF_KSWAPD       0x00040000      /* I am kswapd */
> 
>   static inline int current_is_kswapd(void)
>   {
>   	return current->flags & PF_KSWAPD;
>   }

> I think ((char) (current->flags & 0x00040000)) == 0.
> Your patch wants
> 
>   -args->kswapd = current_is_kswapd();
>   +args->kswapd = (current_is_kswapd() != 0);

Thanks for pointing that out, but I think:

-static inline int current_is_kswapd(void)
+static inline bool current_is_kswapd(void)

is a better solution. It can only be true or false.

But regardless, I need to change the boolean options in that XFS
structure to be, well, booleans.

Cheers,

dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  0:44 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Dave Chinner
  2014-05-20  3:54   ` Tetsuo Handa
@ 2014-05-20  5:59   ` Andrew Morton
  2014-05-20  6:03     ` Andrew Morton
                       ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Andrew Morton @ 2014-05-20  5:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro,
	fengguang.wu, kamezawa.hiroyu

On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:

> @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
>  	struct xfs_bmalloca	*args = container_of(work,
>  						struct xfs_bmalloca, work);
>  	unsigned long		pflags;
> +	unsigned long		new_pflags = PF_FSTRANS;
>  
> -	/* we are in a transaction context here */
> -	current_set_flags_nested(&pflags, PF_FSTRANS);
> +	/*
> +	 * we are in a transaction context here, but may also be doing work
> +	 * in kswapd context, and hence we may need to inherit that state
> +	 * temporarily to ensure that we don't block waiting for memory reclaim
> +	 * in any way.
> +	 */
> +	if (args->kswapd)
> +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;

So current_is_kswapd() returns true for a thread which is not kswapd. 
That's a bit smelly.

Should this thread really be incrementing KSWAPD_INODESTEAL instead of
PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
only one(?) of which you actually want.

It would be cleaner to create a new PF_ flag to select just that
behavior.  That's a better model than telling the world "I am magic and
special".

But we're awfully close to running out of PF_ space and I don't know if
this ugly justifies consuming a flag.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  5:59   ` Andrew Morton
@ 2014-05-20  6:03     ` Andrew Morton
  2014-05-20  6:33       ` Dave Chinner
  2014-05-20  6:30     ` Dave Chinner
  2014-05-20  7:20     ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2014-05-20  6:03 UTC (permalink / raw)
  To: Dave Chinner, Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, linux-kernel, xfs

On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> >  	struct xfs_bmalloca	*args = container_of(work,
> >  						struct xfs_bmalloca, work);
> >  	unsigned long		pflags;
> > +	unsigned long		new_pflags = PF_FSTRANS;
> >  
> > -	/* we are in a transaction context here */
> > -	current_set_flags_nested(&pflags, PF_FSTRANS);
> > +	/*
> > +	 * we are in a transaction context here, but may also be doing work
> > +	 * in kswapd context, and hence we may need to inherit that state
> > +	 * temporarily to ensure that we don't block waiting for memory reclaim
> > +	 * in any way.
> > +	 */
> > +	if (args->kswapd)
> > +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> 
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.
> 
> It would be cleaner to create a new PF_ flag to select just that
> behavior.  That's a better model than telling the world "I am magic and
> special".

Or a new __GFP_FLAG.

> But we're awfully close to running out of PF_ space and I don't know if
> this ugly justifies consuming a flag.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  6:03     ` Andrew Morton
@ 2014-05-20  6:33       ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-05-20  6:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro,
	fengguang.wu, kamezawa.hiroyu

On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote:
> On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> > >  	struct xfs_bmalloca	*args = container_of(work,
> > >  						struct xfs_bmalloca, work);
> > >  	unsigned long		pflags;
> > > +	unsigned long		new_pflags = PF_FSTRANS;
> > >  
> > > -	/* we are in a transaction context here */
> > > -	current_set_flags_nested(&pflags, PF_FSTRANS);
> > > +	/*
> > > +	 * we are in a transaction context here, but may also be doing work
> > > +	 * in kswapd context, and hence we may need to inherit that state
> > > +	 * temporarily to ensure that we don't block waiting for memory reclaim
> > > +	 * in any way.
> > > +	 */
> > > +	if (args->kswapd)
> > > +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > 
> > So current_is_kswapd() returns true for a thread which is not kswapd. 
> > That's a bit smelly.
> > 
> > Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> > PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> > only one(?) of which you actually want.
> > 
> > It would be cleaner to create a new PF_ flag to select just that
> > behavior.  That's a better model than telling the world "I am magic and
> > special".
> 
> Or a new __GFP_FLAG.

Sure - and with that XFS will need another PF_ flag to tell the
memory allocator to set the new __GFP_FLAG on every allocation done
in that kworker task context, just like it uses PF_FSTRANS to ensure
that __GFP_NOFS is set for all the allocations in that kworker
context....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  5:59   ` Andrew Morton
  2014-05-20  6:03     ` Andrew Morton
@ 2014-05-20  6:30     ` Dave Chinner
  2014-05-20 14:58       ` Tetsuo Handa
  2014-05-20  7:20     ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-05-20  6:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro,
	fengguang.wu, kamezawa.hiroyu

On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
> On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> >  	struct xfs_bmalloca	*args = container_of(work,
> >  						struct xfs_bmalloca, work);
> >  	unsigned long		pflags;
> > +	unsigned long		new_pflags = PF_FSTRANS;
> >  
> > -	/* we are in a transaction context here */
> > -	current_set_flags_nested(&pflags, PF_FSTRANS);
> > +	/*
> > +	 * we are in a transaction context here, but may also be doing work
> > +	 * in kswapd context, and hence we may need to inherit that state
> > +	 * temporarily to ensure that we don't block waiting for memory reclaim
> > +	 * in any way.
> > +	 */
> > +	if (args->kswapd)
> > +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> 
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.

It's doing work on behalf of kswapd under kswapd constraints, so it
should both behave and be accounted as if kswapd was executing the
work directly.

> It would be cleaner to create a new PF_ flag to select just that
> behavior.  That's a better model than telling the world "I am magic and
> special".

However, it is magic and special because of who the work needs to be
done for.

> But we're awfully close to running out of PF_ space and I don't know if
> this ugly justifies consuming a flag.

I don't really care enough argue over what mechanism should be used.
I'll push this patch through the XFS tree, and when a new flag or
generic mechanism for pushing task contexts to kworker threads
is provided, I'll change the XFS code to use that....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  6:30     ` Dave Chinner
@ 2014-05-20 14:58       ` Tetsuo Handa
  2014-05-20 16:12         ` Motohiro Kosaki
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2014-05-20 14:58 UTC (permalink / raw)
  To: david, riel
  Cc: linux-kernel, xfs, hch, kosaki.motohiro, akpm, fengguang.wu,
	kamezawa.hiroyu

Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
He does not like the idea of adding timeout to throttle loop. As Dave
posted a patch that fixes a bug in XFS delayed allocation, I updated
my patch accordingly.

Although the bug in XFS was fixed by Dave's patch, other kernel code
would have bugs which would fall into this infinite throttle loop.
But to keep the possibility of triggering OOM killer minimum, can we
agree with this updated patch (and in the future adding some warning
mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting
memory allocation stall)?

Dave, if you are OK with this updated patch, please let me know
commit ID of your patch.

Regards.
----------
>From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 20 May 2014 23:34:34 +0900
Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().

I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
too many pages are isolated already" causes RHEL7 environment to stall
with 0% CPU usage when a certain type of memory pressure is given.
This is because nobody can reclaim memory due to rules listed below.

  (a) XFS uses a kernel worker thread for delayed allocation
  (b) kswapd wakes up the kernel worker thread for delayed allocation
  (c) the kernel worker thread is throttled due to commit 35cd7815

This patch and commit XXXXXXXX "xfs: block allocation work needs to be
kswapd aware" will solve rule (c).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..5c6960e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

-	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	/*
+	 * Throttle only direct reclaimers. Allocations by kswapd (and
+	 * allocation workqueue on behalf of kswapd) should not be
+	 * throttled here; otherwise memory allocation will deadlock.
+	 */
+	if (!sc->hibernation_mode && !current_is_kswapd()) {
+		while (unlikely(too_many_isolated(zone, file, sc))) {
+			congestion_wait(BLK_RW_ASYNC, HZ/10);

-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+			/*
+			 * We are about to die and free our memory.
+			 * Return now.
+			 */
+			if (fatal_signal_pending(current))
+				return SWAP_CLUSTER_MAX;
+		}
 	}

 	lru_add_drain();
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20 14:58       ` Tetsuo Handa
@ 2014-05-20 16:12         ` Motohiro Kosaki
  2014-05-26 11:45           ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Motohiro Kosaki @ 2014-05-20 16:12 UTC (permalink / raw)
  To: Tetsuo Handa, david@fromorbit.com, riel@redhat.com
  Cc: linux-kernel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org,
	Motohiro Kosaki JP, akpm@linux-foundation.org,
	fengguang.wu@intel.com, kamezawa.hiroyu@jp.fujitsu.com



> -----Original Message-----
> From: Tetsuo Handa [mailto:penguin-kernel@I-love.SAKURA.ne.jp]
> Sent: Tuesday, May 20, 2014 11:58 PM
> To: david@fromorbit.com; riel@redhat.com
> Cc: Motohiro Kosaki JP; fengguang.wu@intel.com; kamezawa.hiroyu@jp.fujitsu.com; akpm@linux-foundation.org;
> hch@infradead.org; linux-kernel@vger.kernel.org; xfs@oss.sgi.com
> Subject: Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
> He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I
> updated my patch accordingly.
> 
> Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop.
> But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some
> warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)?
> 
> Dave, if you are OK with this updated patch, please let me know commit ID of your patch.
> 
> Regards.
> ----------
> >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 20 May 2014 23:34:34 +0900
> Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too many pages are isolated already" causes RHEL7
> environment to stall with 0% CPU usage when a certain type of memory pressure is given.
> This is because nobody can reclaim memory due to rules listed below.
> 
>   (a) XFS uses a kernel worker thread for delayed allocation
>   (b) kswapd wakes up the kernel worker thread for delayed allocation
>   (c) the kernel worker thread is throttled due to commit 35cd7815
> 
> This patch and commit XXXXXXXX "xfs: block allocation work needs to be kswapd aware" will solve rule (c).
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/vmscan.c |   20 +++++++++++++++-----
>  1 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..5c6960e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	struct zone *zone = lruvec_zone(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> 
> -	while (unlikely(too_many_isolated(zone, file, sc))) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	/*
> +	 * Throttle only direct reclaimers. Allocations by kswapd (and
> +	 * allocation workqueue on behalf of kswapd) should not be
> +	 * throttled here; otherwise memory allocation will deadlock.
> +	 */
> +	if (!sc->hibernation_mode && !current_is_kswapd()) {
> +		while (unlikely(too_many_isolated(zone, file, sc))) {
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> -		/* We are about to die and free our memory. Return now. */
> -		if (fatal_signal_pending(current))
> -			return SWAP_CLUSTER_MAX;
> +			/*
> +			 * We are about to die and free our memory.
> +			 * Return now.
> +			 */
> +			if (fatal_signal_pending(current))
> +				return SWAP_CLUSTER_MAX;
> +		}
>  	}


Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to prevent false oom-kill. So, simple timeout
resurrect it. Please let me know if you need further MM enhancement to solve XFS issue. I'd like join and assist this.

Thanks.






_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list().
  2014-05-20 16:12         ` Motohiro Kosaki
@ 2014-05-26 11:45           ` Tetsuo Handa
  2014-06-03 21:43             ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2014-05-26 11:45 UTC (permalink / raw)
  To: Motohiro.Kosaki, david, riel
  Cc: linux-kernel, xfs, hch, kosaki.motohiro, akpm, fengguang.wu,
	kamezawa.hiroyu

Tetsuo Handa wrote:
> Dave, if you are OK with this updated patch, please let me know commit ID of your patch.

I overlooked that too_many_isolated() in mm/vmscan.c already has

  if (current_is_kswapd())
    return 0;

lines. Therefore, it turned out that my patch is irrelevant to XFS issue.
Does patch shown below make sense?

Regards.
----------
>From e133aadd65df3d8622efced3443888ed3e60327b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 26 May 2014 20:37:12 +0900
Subject: [PATCH] mm/vmscan: Do not throttle when hibernation_mode != 0.

In shrink_inactive_list(), we do not insert delay at

  if (!sc->hibernation_mode && !current_is_kswapd())
    wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

if sc->hibernation_mode != 0.
Follow the same reason, we should not insert delay at

  while (unlikely(too_many_isolated(zone, file, sc))) {
    congestion_wait(BLK_RW_ASYNC, HZ/10);

    /* We are about to die and free our memory. Return now. */
    if (fatal_signal_pending(current))
      return SWAP_CLUSTER_MAX;
  }

if sc->hibernation_mode != 0.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..89c42ca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
 	if (current_is_kswapd())
 		return 0;
 
+	if (sc->hibernation_mode)
+		return 0;
+
 	if (!global_reclaim(sc))
 		return 0;
 
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list().
  2014-05-26 11:45           ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
@ 2014-06-03 21:43             ` David Rientjes
  2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2014-06-03 21:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, Motohiro.Kosaki,
	fengguang.wu, akpm, kamezawa.hiroyu

On Mon, 26 May 2014, Tetsuo Handa wrote:

> In shrink_inactive_list(), we do not insert delay at
> 
>   if (!sc->hibernation_mode && !current_is_kswapd())
>     wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> if sc->hibernation_mode != 0.
> Follow the same reason, we should not insert delay at
> 
>   while (unlikely(too_many_isolated(zone, file, sc))) {
>     congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
>     /* We are about to die and free our memory. Return now. */
>     if (fatal_signal_pending(current))
>       return SWAP_CLUSTER_MAX;
>   }
> 
> if sc->hibernation_mode != 0.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/vmscan.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..89c42ca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> +	if (sc->hibernation_mode)
> +		return 0;
> +
>  	if (!global_reclaim(sc))
>  		return 0;
>  

This isn't the only too_many_isolated() functions that do a delay, how is 
the too_many_isolated() in mm/compaction.c different?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-06-03 21:43             ` David Rientjes
@ 2014-06-05 12:45               ` Tetsuo Handa
  2014-06-05 13:17                 ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2014-06-05 12:45 UTC (permalink / raw)
  To: rientjes, Motohiro.Kosaki
  Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, akpm, fengguang.wu,
	kamezawa.hiroyu

David Rientjes wrote:
> On Mon, 26 May 2014, Tetsuo Handa wrote:
> 
> > In shrink_inactive_list(), we do not insert delay at
> > 
> >   if (!sc->hibernation_mode && !current_is_kswapd())
> >     wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > 
> > if sc->hibernation_mode != 0.
> > Follow the same reason, we should not insert delay at
> > 
> >   while (unlikely(too_many_isolated(zone, file, sc))) {
> >     congestion_wait(BLK_RW_ASYNC, HZ/10);
> > 
> >     /* We are about to die and free our memory. Return now. */
> >     if (fatal_signal_pending(current))
> >       return SWAP_CLUSTER_MAX;
> >   }
> > 
> > if sc->hibernation_mode != 0.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  mm/vmscan.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 32c661d..89c42ca 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
> >  	if (current_is_kswapd())
> >  		return 0;
> >  
> > +	if (sc->hibernation_mode)
> > +		return 0;
> > +
> >  	if (!global_reclaim(sc))
> >  		return 0;
> >  
> 
> This isn't the only too_many_isolated() functions that do a delay, how is 
> the too_many_isolated() in mm/compaction.c different?
> 

I don't know. But today I realized that this patch is not sufficient.

I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
when a certain type of memory pressure is given on a RHEL7 environment with
4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
time-consuming functions inside __alloc_pages_slowpath(). But on the first run,
all processes are trapped into this too_many_isolated()/congestion_wait() loop
while kswapd is not running; stalling forever because nobody can perform
operations for making too_many_isolated() to return 0.

This means that, under rare circumstances, it is possible that all processes
other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
while kswapd is sleeping because this loop assumes that somebody else shall
wake up kswapd and kswapd shall perform operations for making
too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
waken by somebody nor kswapd is not blocked by blocking operations inside
shrinker functions (e.g. mutex_lock()).

We need some more changes. I'm thinking memory allocation watchdog thread.
Add an "unsigned long" field to "struct task_struct", set jiffies to the field
upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
periodically scans task list and compares the field and jiffies, and (at least)
print warning messages (maybe optionally trigger OOM-killer or kernel panic)
if single memory allocation attempt is taking too long (e.g. 60 seconds).
What do you think?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
@ 2014-06-05 13:17                 ` Dave Chinner
  2014-06-06 12:19                   ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-06-05 13:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, rientjes,
	Motohiro.Kosaki, fengguang.wu, akpm, kamezawa.hiroyu

On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Mon, 26 May 2014, Tetsuo Handa wrote:
> > 
> > > In shrink_inactive_list(), we do not insert delay at
> > > 
> > >   if (!sc->hibernation_mode && !current_is_kswapd())
> > >     wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > > 
> > > if sc->hibernation_mode != 0.
> > > Follow the same reason, we should not insert delay at
> > > 
> > >   while (unlikely(too_many_isolated(zone, file, sc))) {
> > >     congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > 
> > >     /* We are about to die and free our memory. Return now. */
> > >     if (fatal_signal_pending(current))
> > >       return SWAP_CLUSTER_MAX;
> > >   }
> > > 
> > > if sc->hibernation_mode != 0.
> > > 
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > ---
> > >  mm/vmscan.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 32c661d..89c42ca 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
> > >  	if (current_is_kswapd())
> > >  		return 0;
> > >  
> > > +	if (sc->hibernation_mode)
> > > +		return 0;
> > > +
> > >  	if (!global_reclaim(sc))
> > >  		return 0;
> > >  
> > 
> > This isn't the only too_many_isolated() functions that do a delay, how is 
> > the too_many_isolated() in mm/compaction.c different?
> > 
> 
> I don't know. But today I realized that this patch is not sufficient.
> 
> I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
> when a certain type of memory pressure is given on a RHEL7 environment with
> 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
> time-consuming functions inside __alloc_pages_slowpath(). But on the first run,
> all processes are trapped into this too_many_isolated()/congestion_wait() loop
> while kswapd is not running; stalling forever because nobody can perform
> operations for making too_many_isolated() to return 0.
> 
> This means that, under rare circumstances, it is possible that all processes
> other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
> while kswapd is sleeping because this loop assumes that somebody else shall
> wake up kswapd and kswapd shall perform operations for making
> too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
> waken by somebody nor kswapd is not blocked by blocking operations inside
> shrinker functions (e.g. mutex_lock()).

So what you are saying is that kswapd is having problems with
getting blocked on locks held by processes in direct reclaim?

What are the stack traces that demonstrate such a dependency loop?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list().
  2014-06-05 13:17                 ` Dave Chinner
@ 2014-06-06 12:19                   ` Tetsuo Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2014-06-06 12:19 UTC (permalink / raw)
  To: david
  Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, rientjes,
	Motohiro.Kosaki, fengguang.wu, akpm, kamezawa.hiroyu

Dave Chinner wrote:
> On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote:
> > This means that, under rare circumstances, it is possible that all processes
> > other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
> > while kswapd is sleeping because this loop assumes that somebody else shall
> > wake up kswapd and kswapd shall perform operations for making
> > too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
> > waken by somebody nor kswapd is not blocked by blocking operations inside
> > shrinker functions (e.g. mutex_lock()).
> 
> So what you are saying is that kswapd is having problems with
> getting blocked on locks held by processes in direct reclaim?
> 
> What are the stack traces that demonstrate such a dependency loop?

If a normal task's GFP_KERNEL memory allocation called a shrinker function and
the shrinker function does GFP_WAIT-able allocation with a mutex held, there
is a possibility that kswapd is waken up due to GFP_WAIT-able allocation and
kswapd calls the shrinker function, and the kswapd is blocked at trying to
hold the same mutex inside the shrinker function, isn't it?

Since ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() holds a mutex
and ttm_dma_pool_shrink_scan() does GFP_WAIT-able allocation, I think such
a dependency loop is possible.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  5:59   ` Andrew Morton
  2014-05-20  6:03     ` Andrew Morton
  2014-05-20  6:30     ` Dave Chinner
@ 2014-05-20  7:20     ` Christoph Hellwig
  2 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2014-05-20  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro,
	fengguang.wu, kamezawa.hiroyu

On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.

Actually we want all of them. The allocation workqueue is a workaround
for the incredible stack usage in the Linux I/O path.  If it is called
by kswapd it should act as if it were kswapd for all purposes.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-06-06 12:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <201405192340.FCD48964.OFQHOOJLVSFFMt@I-love.SAKURA.ne.jp>
2014-05-20  0:44 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Dave Chinner
2014-05-20  3:54   ` Tetsuo Handa
2014-05-20  5:24     ` Dave Chinner
2014-05-20  5:59   ` Andrew Morton
2014-05-20  6:03     ` Andrew Morton
2014-05-20  6:33       ` Dave Chinner
2014-05-20  6:30     ` Dave Chinner
2014-05-20 14:58       ` Tetsuo Handa
2014-05-20 16:12         ` Motohiro Kosaki
2014-05-26 11:45           ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
2014-06-03 21:43             ` David Rientjes
2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
2014-06-05 13:17                 ` Dave Chinner
2014-06-06 12:19                   ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
2014-05-20  7:20     ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox